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O ' Abstract 

' Shrinkage estimators of covariance are an important tool in modern applied and theoretical statis- 

tics. They play a key role in regularized estimation problems, such as ridge regression (aka Tykhonov 
regularization), regularized discriminant analysis and a variety of optimization problems 

In this paper, we bring to bear the tools of random matrix theory to understand their behavior, 
and in particular, that of quadratic forms involving inverses of those estimators, which are important in 
practice. 

We use very mild assumptions compared to the usual assumptions made in random matrix theory, 
requiring only mild conditions on the moments of linear and quadratic forms in our random vectors. 
In particular, we show that our results apply for instance to log-normal data, which are of interest in 
' financial applications. 

' Our study highlights the relative sensitivity of random matrix results (and their practical conse- 

quences) to geometric assumptions which are often implicitly made by random matrix theorists and 
may not be relevant in data analytic practice. 

_^ ■ 1 Introduction 

o 

T^lj- ' Modern multivariate statistics is increasingly high-dimensional. It is now easy to collect many samples 

(n) with a large number of covariates (p) for each sample. In this paper, we will therefore study multivariate 
: statistical problems in the "large n, large p" setting that is increasingly popular in theoretical statistics. 
■ By this we mean that we will study certain statistics in the asymptotic setting where n, the number of 
observations, is going to infinity, and p, the number of predictors, is also going to infinity. Our focus will 
be on the situation where p/n remains bounded. 

The paper is mostly concerned with forms involving the inverse of a shrunken covariance matrix, or 
powers of this inverse as they play a key role in several important statistical problems that we review later 
. in this introduction. As a matter of fact, these objects, in one form or another, are central in many aspects 
of classical regularized methods in statistics and other fields of applied mathematics. The purpose of this 
paper is to explain how these regularized estimators behave in the "large p, large n" setting and derive 
some understanding and insights about the behavior of widely used methods that rely on them. 

In classical statistics, when p <C n, one can get a good estimate of the spectral properties of S, the 
population covariance matrix, by using its "naive" counterpart, the sample covariance matrix S, with, if 
/I is the sample mean of our vectors, 



X 



n 



1 " 

—Y^{Xi-mxi-f2y. 
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As is now well-known, this is not the case when p is comparable to n, which we denote by p x n. In that 
setting, even though the central limit theorem and a little bit of concentration of measure guarantee under 
broad assumptions that 

max|S(z, j) - j)| , 

(even when p n), the eigenvalues of S tend to be very different from those of S (see Johnstone (2001) 
or the reviews Johnstone (2007), El Karoui (2011)). Hence, it is important to understand the performance 
of our standard techniques in this new asymptotic setting. 

Recent papers concerned with these types of problems and their implications for concrete applications 
are for instance El Karoui (2009b) and El Karoui (2009c), where the author showed that somewhat surpris- 
ingly for a broad class of covariance matrices, means and distributions for the data, one could characterize 
the performance of estimators as a function of the ratio p/n, and hence get consistent estimators for pa- 
rameters, such as the efficient frontier in classical portfolio theory, that appear difficult to estimate without 
structural assumptions on the population parameters. In these papers, the regularization came under the 
form of linear constraints on the vector of interest. 

As mentioned before, shrinkage estimators of covariance are fundamental objects in modern statistics, 
partly because of James-Stein type phenomena (Haff (1980)) and they are very widely used. Here are a 
few examples. 

1. Classification (LDA, RDA): when we observe data coming from two Gaussian populations, with 
different means fii and fi2, priors tti and 772 but same covariance matrix S, the optimal classification 
rule is known to be Fisher's linear discriminant analysis rule: classify an observation x to class 2, if 

x'S"^(^i - /i2) > T{ni,fi2, i;,7ri,7r2) , 

where T(/ii, /i2, S, vri, 7r2) is a known threshold. Naturally, we do now know S in practice, so a natural 
method is to replace it by S. This is what is usually done in LDA (see Hastie et al. (2009)). In 
Friedman (1989), concerned by, among other things variance issues in LDA, Friedman proposed to 
use RDA, regularized discriminant analysis, where instead of using T, as an estimate of S, one uses 
S-fj4 or {1 — 0)T, + 9A, i.e a shrinkage estimator of covariance. This estimator has also been proposed 
by Ledoit and Wolf (2004) in another context. It is natural to ask what happens when using these 
estimators in high-dimension. 

2. Shrinkage estimators of covariance: a classic paper on the topic is Haff (1980); we also refer to 
Anderson (2003), for explanations concerning the benefit of skrinkage. In portfolio optimization, at 
least in the traditional mean-variance framework, similar issues arise. Hence partly motivated by 
this problem, Ledoit and Wolf (2004) proposed to use a shrinkage estimator to solve the portfolio 
optimization problem and get regularized solutions. In the finance literature, there are "finance- 
driven" shrinkage estimators, like the one arising in the Black-Litterman model (see Meucci (2005)). 

3. Regression problems: in ridge regression, where one seeks /3 to optimize \\Y — Xf3\\ + XP'Tf3, one also 
encounters matrices of the form S -|- AF, which is a shrunken version of S. The F that is usually 
taken is Id, this regularization amounts to modifying the eigenvalues of S. 

In the analysis of all these methods, one needs to understand the behavior of the matrix (S + A)~^ 
(entrywise and/or globally) as well as similar quantities involving (S -|- A)~^T,^{T, + A)~^ (where is 
positive semidefinite) and this will be one of the focuses of the paper. It is tantalizing to use random matrix 
theory to do so, a program we got started on in El Karoui (2009b) and El Karoui (2009c). However, as 
documented in these papers, random matrix theory has several potential pitfalls: standard random matrix 
models, though in appearance general, put implicitly very strong geometric constraints on the datasets 
they are supposed to model. In light of this, one might be wary that the remarkable results that come out 
of it are just consequences of this geometry, which may or may not be similar to the one a practitioner 
encounters in practice. Hence we feel that any analysis that is not doing a meaningful robustness analysis 
is sorely lacking. 
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As we have documented before, the geometric constraints put by classical random matrix theory on the 
datasets modeled by it are due to manifestations of the concentration of measure phenomenon. Hence, it 
seems to us that a good starting point for the analysis of shrunken covariance matrices and their applications 
is that of generalized elliptical distributions, where the data is modeled as 

= + RiXi , 

where Ri is a random variable independent of Xi and Xi has some (mild) concentration properties. (This 
will be made clear and precise later.) 

The advantage of this class of models is that it contains the Gaussian model that is popular with 
many researchers, though now understood to be lacking in many fundamental ways. When E (-Rf) = 1, 
then cov (Xj) = cov {Xi), so we can study robustness of our results in this class, since all the population 
parameters (which will depend on covariance and mean) will be the same. 

However, by studying the model at this level of generality, we will not be able to rely on various invari- 
ance properties of the Gaussian distribution, and hence will really use only the geometric/concentration 
properties of the random variables of interest. One advantage of such an approach is that these properties 
are somewhat checkable in practice, through simple histograms for e.g norms and scalar products of points 
in the dataset, as has been explained before in some of the works cited above. Crucially, by showing that 
the results depend on the properties of we will able to show that even in our simple setting the 

geometry is key (change in i?j's may mean change in the geometry) and a major contributing factor in the 
robustness of the results. Finally, it should be noted (see El Karoui (2009b)) that one can sometimes study 
the bootstrap properties of various estimators by studying the class of elliptical distributions. Hence our 
analysis could be used to gain insight into bootstrap properties of various estimators. 

The focus of our paper will mostly be on entrywise properties of (S + A)"'^ or (S + A)~^Se(S + A)^"^ 
in the class of models we consider, which naturally appear in the study of the risk of certain procedures. 
Quadratic forms involving the sample mean are also important in practice and will be studied. Random 
matrix theory already handles well things like trace ^(S + ^)~^^ , and other questions concerning only 

eigenvalues, so we will not spend too much time on this, though they are potentially important in the 
study of the risk of various estimators. 

Beside shedding light on central statistical questions in multivariate analysis, our analysis also proposes 
what we think is a good and generic technical framework for carrying them out: namely we will do our 
work through invariance principles and mild concentration work. We will show that the statistics we are 
considering are asymptotically non-random, by showing that they are concentrated around their mean. 
And then we will show that the mean is the "same" in a broad class of models by using techniques akin to 
the Lindeberg method. A main difficulty is then to compute the mean (in many problems it is much harder 
to compute the mean of a statistic than to show that e.g its variance goes to zero), but our analysis will 
show that it can be done for favorable distributions in the class considered, and the Gaussian distribution 
will then be heavily used. Importantly, our analysis is very general and shows robustness even in classes 
where we have not or cannot at this point compute a limit for the quantity of interest. 

We should also point out that our concentration requirements on Xi have purposely been kept to a 
minimum and hence our results extend way beyond the traditional "linear combination of i.i.d" framework 
which has been popular in random matrix theory following the nice work of Bai and Silverstein (see e.g 
Silverstein (1995), Silverstein and Bai (1995)). In particular, we will be able to handle (multivariate) 
log-normal distributions and other non-linear deformations of Gaussian random variables. Also, conditions 
on i.i.d-ness are essentially replaced by conditions on the mean and covariance of the random variables we 
deal with, as well as a little bit of concentration for linear and quadratic forms involving them. Our aim 
was also to show that these "universality" results could get obtained rather simply so an effort has been 
made to make the proofs as simple as possible. The paper is a bit long because we treat many cases in 
details and at what we think is the right level of generality. 

Finally, it will be noted by researchers interested in probability that some of our results can be seen as 
strong versions of classic random matrix results: where classic results gave results about normalized traces 
of certain random matrices, we will be able to have statements valid for each element of the diagonal of 
the matrix of interest. 
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In section 2, we present some of our main technical results and heuristic justification for some of the 
main results, which should be helpful for statisticians wanting to get a sense of where the results come from. 
Section 3 contains most proofs and the core technical work. Section 4 discusses some potential applications 
to statistics, where at this point our main results shed light on existing procedures and "what they really 
do". We conclude in Section 5 and present a result of independent interest on Stieltjes transforms in the 
Appendix. 



2 Strategy and exposition of some results 

Our strategy is to make use of invariance principles and concentration inequalities throughout the 
paper. Practically, this translates into showing that the statistics we care about are concentrated around 
their means, that is the concentration part. In a second step, we show that this mean does not depend 
of the distribution of the data, as long as certain moment conditions are satisfied. To do so, we employ 
techniques very similar to the Lindeberg method (Stroock (1993) and let us note that it has been perhaps 
"re-popularized" by the nice work of Chatterjee in this direction, e.g Chatterjee (2005)). 

Throughout the paper, we will focus on model of an elliptical type, namely we observe i.i.d observations 

= l-l' + RiXi , 

where the Xis are independent and independent of Ri. The i?j's are allowed to be dependent. Our efforts 
will go into relaxing distributional assumptions on Xi, while assuming only two moments on Ri - the 
justification for these choices coming from applications discussed at the end of the paper. In particular, 
this means that we will be able to handle data with relatively heavy tails. 

A main tool in our work will be a simple extension of the Efron-Stein inequality - which will allow us to 
characterize higher moments of the statistics we care about. This extension is likely known in martingale 
theory but we present a proof in the appendix for the convenience of the reader. We delay it statement 
and presentation to the proof section and start by highlighting some of our main results. 

2.1 A generalized version of the Efron-Stein inequality 

We will make repeated use of the following lemma, which follows from Burkholder's inequality (see 
Burkholder (1973)). 

Lemma. Suppose W = h{Xi, . . . ,Xn), where the Xi 's are independent. We call J-j = cr(Xi, . . . ,Xj). We 
also denote by Zm a (measurable) function of {Xi, . . . , X^-i, Xm+i, ■ ■ ■ , Xn)- 
Then, we have, for a constant c that depends only on k, and for k >2, 

/ /r n -|'^V2\ „ \ 

E(|I^-E(Ty)|'=) <c E ^ E ((W^ - M^^) Vm-i) +5^E(|I^-Ty^|'=) . (1) 



.m=l 



m=l 



The classic Efron-Stein inequality corresponds to the case where k = 2. The advantage of using higher 
/c's is that it will for instance allow us to control maxjgj \ Wj — E (Wj) \ for J's of higher cardinalities. For 

instance, if we can show that E [\Wj — E (Wj)!*^) < Cn~'^/'^ for a certain k, a simple union bound gives 



us 



P(max \Wj-^ {Wj) \>t)< 



Hence a bound valid of A; > 2 will allow us to handle greater J's. A number of applications (involving for 
instance thresholding) also require control of higher moments, which will be provided by our methods. 

We also note that we purposely tried to avoid deriving central limit theorems. While those are definitely 
interesting, we wanted to have finite sample bounds and have them be relatively robust with respect to 
distributional assumptions, in keeping with what we view as their potential practical usefulness. 
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2.2 Quadratic forms in inverse of shrunken sample covariance matrices are essentially 
deterministic 

We now state an application of the previous Lemma to forms which are at the center of our study. 

Theorem. Suppose Xi, . . . ,Xn G MP are independent. Suppose further that E (Xj) = and, if v is such 
that \\v\\ = 1, B{\X'iv\'') < bL{k]Xi), where bL{k]Xi) is a deterministic function depending only on the 
distribution of Xi and k. Call 

^ m 

5 = - R}XiX[ , 
1=1 

where Ri are deterministic. 

Call M{t) = S + A, and assume that for some t > d, A is positive definite, with A ^ tidp. Then, if 
\\x\\ = 1, 

e(|x'[5]-1x-E(x'[5]-^x)|^^) <^ 

It is perhaps instructive to give an example at this point. Here are two. 
• Suppose that Xi satisfies > t) < Cexp(— ct^), and Xj has mean 0. Then 



U{A;Xi)At' 




2k 



-b{2k;Xi) m'' 



Suppose that Xi satisfies P(\X[v\ > t) < Ct'^ . Then \ib> {k + 1), 

1 



We note that the condition on the Xj's is rather minimal: all we need is some concentration of linear 
forms in Xi, something that might seem surprising at first. 

The exponential deviation inequality in our first example might look like a strong assumption. However, 
it is satisfied by many distributions, with quite non-linear structures which would be difficult to analyze 
if one did not resort to concentration of measure statements. The (centered) Gaussian copula is a good 
example. We give specific examples in Subsubsection 3.2.1. 

The result also gives us a reasonable understanding of the size of the fluctuations behavior of the 
quadratic forms we are interested in. Note that using the Gaussian case (at t = 0) as a comparison, the 
fluctuation size of n"^/^ seems to be the right one. 



General strategy The general strategy is now clear. In light of the previous theorem, if we can get a 
good deterministic approximation to E (S'(t)~^), we will be able to get an approximation of x' S{t)~^x. 
Note that the considerable simplification here is that we are not dealing with random variables anymore. 
Fortunately, we can approximate this expectation using variant of methods that have been developed in 
the random matrix literature (specifically the part of the theory concerned with understanding limiting 
spectral distributions). Also, it will be possible to show that these expectations do not vary much when 
we change some details of the distributions - this is the essence of Lindeberg-style ideas. Hence, all we will 
have to do is show that the expectations in question do not change much when we replace XiS by Y^'s 
with a different distribution (but the same covariance and mean). And then compute the expectation in a 
favorable case, for instance when XiS are Gaussian. 
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2.3 Heuristics 



To help readers unfamiliar with random matrix theory understand better the results, we now present 
heuristics that help us guess the results. Formal proofs essentially start from these conjectures and proceed 
to verify that they are indeed correct. 

We will focus on two types of quantities: 

v'{S + A)-\ and v'{J: + A)-^B{J: + A)-\ , 

where A and B are positive definite matrices. 

Also, S = lTA=iR}XiK^ where Xi = Y^I'^Yi, where Yi has covariance Idp, and Xi (or Yi) satisfies 
mild concentration inequalities - the details are given when we undertake a rigorous proof. At this point, 
the reader can safely assume that Xi is A/'(0,S) (so Yi is AA(0,Idp). In other words, S is the "sample" 
covariance matrix we would use if we knew the mean of the data. 

We have the following heuristic result: 

Heuristic 2.1. Under regularity conditions, we have 

v'{S + A)-\ ~ v'{-f{A)^ + A)-^v , 

where if 

a(A) = - trace (S(5 + A)''^) , 
n 

a{A) has an asymptotically deterministic equivalent and 



1 

liA) - - XI T 
n ^ 1 



Argument: The key element of this argument is really the concentration of quadratic forms in Yi, which 
allow us to replace quantities of the type Y-MYi/p by trace (M) /p = E (Y-MYi) /p. 

The fact that ^trace (S(5 + A)~^^ has an asymptotically deterministic equivalent comes from standard 
arguments in random matrix theory (for some that rely on concentration and are just a few lines, see El 
Karoui (2009a); see also Subsection 3.5). Let us write S = Yll=i'^i^'i-' where are independent. Now, we 
have (using an idea akin to some in Silverstein (1995) and now classic in random matrix theory) 

S{S + Ay^ = Id - A{S + AY^ , 
and hence, using the fact that (rjr- + Mi)~^ = — ^ — tjt^ — > 



i=l 



where Mi = S + A — rir[. 

Therefore, if v and u are two vectors. 



v' A{S + A)-^u = v'u - 



v'rir[M' 



1=1 ' t t * 

Now because Yi satisfies a dimension-free concentration inequality, we have, if M is a matrix independent 
of Yi, Y^MYi/p ~ trace (M) /p. Applying this heuristic in each term of the previous sum, we get, 

v'A{S + Ay\ = v'u--y^ ^ . 
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Now not much is lost by replacing Mi by 5 + ^ everywhere in the previous expression. Hence, we have 
heuristically, 



v'A{S + A)~^u = v'u 



n 

n ^ 1 



. , Retrace ins + Ay ^] 

v'u-j{A)v'j:{s + Ay^u . 



v'nS + A)-^u 



Another way of rewriting this equation is simply 

v'{S + Ay^u = v'A"^u - -i{A)v'A"'^nS + Ay^u . 

Now, let us call Vk = (j4~^S)'^t;. Applying the previous heuristic to v = Vk and n = f, we have if 
/3fc = v'^{S + Ay^v, and = v'^A~^v, 

Pk-ak- 7(^)/3fc+i • 
Assuming that we can use the previous approximation many times, we get 

n 
3=0 

Now assuming that we can sum the series and that {'^{A)y~^'^ j3n+i — )■ 0, we get 



3=0 



1 I 
■^Oj = V 



A-^v 



j=0 

v'{ld + -f{A)A-^n-^A-\ = v'{A + j{A)J:y\ . 



Note that f3o = v'{S + A) ^v. Hence, it is perhaps reasonable to conjecture that 

v'{s + Ay\ ~ v'{A + j{A)j:y\ . 

Note that the heuristic also gives us conjectures for approximating the value of v'{S + A)^^(j4~^S)'^t>, for 
any given k, as this is what we called earlier /3k- □ 

For dealing with higher powers of {S + Ay^, we also need the following heuristic. 

Heuristic 2.2. Under regularity assumptions, we have 

v'{S + Ay^B{S + Ay^v ~ v'{A + 7(A)S)-i(5 + ^{A, + 7(^)S)-i?; , 

where ^{A) is defined in Heuristic 2.2 and 



aA,B) 



1 " 

-T — 



Rf 



n 



trace + Ay^B{S + Ay^) 



n^yi + RMA)) 
Furthermore, (,{A,B) has an asymptotically deterministic equivalent. 

Argument : Let us call f{t) = v'{S + A{t)y^v. Then, since {[M{t)y^y = [M{t)y^M'{t)[M{t)y^ , we 

fit) = -v'{S + A{t)y'A'it){S + Ait)y'v . 
Now, if we consider A{t) = A + tB, we see that A'{t) = B, and therefore, 

/'(o) = -v'{s + Ay'B{s + Ay\ , 

which is the quantity we seek to approximate. 
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Now recall that from Heuristic 2.1, we gathered that 



v'{S + Ay^v ~ v'{A + -f{A)^)-\ . 



We might be tempted to look at this approximate equality as valid for any A{t) and take the derivative 
with respect to t. Doing so, we would get, if g{t) = v'{S + A{t))~^v, 



g'{0) = -v'{S + Ar\B + ^{A{t))'m){S + A)-\ 



Now, 



n 

7(^(t)) = -Er 



Hence, if h{t) = 7(A(t)) and k{t) = a{A{t)) = ^trace + A{t))-^), we have 

Rf 



^'(0) = -fe'(o)i^- 
n ^ (1 



^ (1 + RM^)r ■ 

Now, k'{t) = -itrace(S(cS + ^(t))-iB(5 + ^(t))-i). Hence, 



k'(0) = -trace + Ay^B(S + A)"^) 
n 



and we conclude that 



n 



trace (E(5 + A)-^B{S + A)'^) 



1 " 

-V- 



R' 



- (1 + i??a(A))2 



aA,B). 



The fact that .^(^4, B) is asymptotically non-random comes from the same ideas as described in Heuristic 
2.1. □ 

In our applications, we will also need to understand quantities of the type + A)~^'jl (where S = 
S — fijl') and + A)~^v. We naturally treat those cases below and refer the reader to that part of 
the paper for information about these forms. The main issue is that when dealing with S and /I, a non- 
negligible interaction term between the two occurs (it is related to Jl'{S + A)^^Jl) and one needs to be a 
bit careful to treat it. 



3 Results and proofs 

This section contains the main technical aspects of the paper. In subsection 3.1, we discuss a simple 
extension of the Efron-Stein inequality. The rest of this section is devoted to showing concentration and 
invariance of the forms we care about. The method of proof is systematic: we first show concentration (i.e 
control of the variance or higher moments), and then show that the mean value to which we can reduce the 
problem does not depend on "details" of the distribution of the data through a Lindeberg-like argument. 

Notations Before we proceed, let us set some notations. We denote by |||M|||2 the operator norm 
(i.e largest singular value) of a matrix M. When dealing with several independent random variables 
{Xi, . . . ,Xn), we use Ej () to denote expectation with respect to Xi only. We often use the abbreviation 
psd for positive semi-definite. 
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3.1 A simple extension of the Efron-Stein inequality 

The strategy for our approach is to first show that the quadratic forms we care about, namely 

v'{S + A)-^v ,Ay tldp , 

(and variants) are essentially deterministic asymptotically. Modern techniques can be adapted to then get 
(in simple cases compared to the generality level at which we will work) deterministic approximations of 
v'{S + A)~^v and we can then use those to actually compute the limit of the aforementioned quadratic 
form. But it is important to get a systematic way of showing that for a certain class of random matrices 
S, 

v'{S + A)-^v ~ v'E {{S + Ay^) V . 

To do so, we propose to use (essentially) a martingale difference argument, which is not unknown in 
random matrix theory (Bai (1999), Girko (1990), and several others), but whose role may not have been 
as emphasized as it perhaps should have. However, at the level of generality at which we are working, our 
proofs become easier if we quickly branch away from standard methods. The following lemma is essentially 
an variant of the Efron-Stein inequality (see Efron and Stein (1981), Theorem 2, and also Lugosi (2006), 
Theorem 9). It is surely known in martingale theory but we give a simple proof here for the convenience 
of the reader. 

Lemma 3.1. Suppose W = h{Xi, . . . ,Xn), where the Xi 's are independent. We call Tj = (j{Xi, . . . , Xj). 
We also denote by Wm a (measurable) function of {Xi, . . . , X^^i, X^+i, ■ ■ ■ , Xn)- 
Then, we have, for a constant c that depends only on k, and for k >2, 

E(|T^-E(t^)|'=) <c E ^ E ((t^ - T^™)2|7:-„_i) + ^ E - M^^l^ . (2) 



m=l 



Note that in the case k = 2, we recover the Efron-Stein inequality 

n 

var(VF) < ^BdW-W^f) , 

m=l 

with a possibly worse constant. 

In the applications we have in mind, through rank-1 update of inverses of matrices, we will easily get 
an approximation of Z by a function that does not involve the m-th variable and these results will come 
in particularly handy. 

Proof of Lemma 3.1. We can clearly write Z — E [Z) as a sum of martingale differences: if 

Kra = E (Z| Jm,) — E (Z| Jm,_i) , 
n 

Z - E (Z) = K^ . 

m=l 

Note also that if is a (measurable) function of all the XiS except Xm, 

Vm = E (Z — Zm\^ m) — E (Z — Zm\^m-l) i 

since E (Z^l J^) = E (Z^lJ^-i). 

Now let us call s{Z) = E (F^| J>„_i)]^/^. Recall that Burkholder's inequality implies (see 

Equation 21.5 in Burkholder (1973)) that, if <I> is a non-decreasing function on [0, oo] with $(0) = and 
$(2A) < ci$(A), then 



E ($(Z)) < c E {Hs{Z))) + 5^ E {mm\)) 



k=l 
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As noted in Burkholder (1973), ^{x) = satisfies the conditions needed for the inequahty to hold. Let 
us remind the reader that it is well known (see Lugosi (2006), p. 16) that 

y^<E((z-E,„(z))2|j:^) , 

where E^ (• • • ) is expectation with respect to X„i only, i.e E^ {Z) = E (Z\Xi, . . . , X^-i, ^m+i, ■ ■ ■ , X^)- 
Also, as noted for instance in Lugosi (2006), 

E^ {{Z - E™, {Z)f) < E„ {{Z - Z^f) , 

where Z^ is any measurable function of Xi, . . . , Xm-i, Xm+i, • • • , Xn- We note that 

E (-l-T-m-l) = E (Em (•) \J^m~l) ■ 

Therefore, 

E {Vl\Tm^i) < E ([Z - E„, (Z)]2| < E (E„ [[Z - E„, {Z)f) | < E ((Z - Z^f\Tm-i) , 

and we have 

n 

s{Z) < ^E((Z-Z„)2|J-„_i) . 

\ m=l 



Hence, because is non decreasing, 

E($(s(Z))) < E I ^> 



m=l 



Now let us turn our attention to E ($(| VmD), specifically when $(x) = x^'. Since Vm = E (Z — Zm\J^n 

E (Z — Zm\J^m-l)i 

IVmf < 2'^-^ (|E (Z - Zm|J-„)|^ + |E (Z - Z,r,\Tm^,t) . 

Also, when k > 1, jx]'^ is convex, so Jensen's inequality implies that 



|E (Z — Zm\J^m)\ < E ( |Z — Zml'^l^r 



Therefore, 



E I \V„,\''] < (\Z- Z., 



Equation (2) now follows easily. 



□ 



We note that if we were willing to make stronger assumptions on the data that the ones we will make, 
we could rely on other concentration inequalities to obtain for instance Gaussian concentration for some of 
the statistics we are interested in. However, since our study is a robustness study, we made the choice of 
making weaker assumptions and consequently to have possibly worse concentration inequalities - though 
of course this allows us to show that our first order results hold for a wider class of distributions. 

3.2 Setup of our study 

In all that follows we make the following assumptions, which we will casually call "our usual assump- 
tions" . 

• We assume that p/n remains bounded away from and oo, i.e p ^ n. 

• the random variables Xj and Yj which will appear below have the same covariance matrix, Sj, and 
same mean, 0. 



• 1^'s are independent and so are Xj^s. 



10 



Yj^s are independent of Xj^s 

If V is any fixed vector with norm 1, we have, for k > 1, 

B(\Xiv\'') <bL{k;Xi) (3) 



If M is any deterministic and positive semidefinite matrix with |||M|||2 < 1, 

E (\X'jMXj - E {X'jMXj) 1^) < bQ.ik; X^) . (4) 



• The matrix towards which we shrink, A, is such that A ^ tldp. 

Let us note that by Jensen's inequahty, there is no loss in generality in assuming that bL{k,Xi) < 
\Jhi,{^k\ Xi). We will assume this throughout this paper, as this will occasionally be needed to merge 
certains bounds arising in our estimates, and thus to shorten our formulas. 
Also we note that if ^4 ^ tid and Eg ^ 0, for any x E M^, we have 

x'(A + So)"^x < ^x'^^^x , 

which is easily seen since M i— )■ is monotone (and decreasing with respect to the Loewner's order), 

so {^A + So)~^ ^ t""'^ld; now multiplying on both sides by (A + Sq)"^/^, the inequality (and its order) is 
preserved and we conclude that {A + Sq)"^ < t~^{A + Hq)"-^ < t'^A-^. 

Finally, let us give some order of magnitude bounds, b]^ will generally be very easy to control, as it is 
a linear form in Xi. For instance, if Xi ^ A/'(0,Idp), we have X'^v ~ A/'(0, ||f ||), so bL{Xi;k) is of order 1 

for all (finite) k. When Xi is AA(0,Idp), X[MXi is a weighted x^ since X[MXi = Y2l=i^k^k{M) where 
are Af{0, 1) and independent. Hence, we conclude that bQ^{k;Xi) is of order at most p^^"^ in this case. 
The informal bounds we will have in mind are therefore 

bL{k;X,) = 0(1) , 
bQ^{k]Xi) _Q^^^| (_ bQ-,{k]Xi) 



pk/2 \ 71^/2 

where the last statement comes from the fact that p ^ n. 
We further note that if S is a covariance matrix, 

bL{k;^'/^X,) < |||S|||^'/'6l(A:;X,) , 
bQ,{k;J:'/^X,) < bQ,{k;X,) . 

To bound bq^ in certain situations, it will be simpler to work through an auxiliary quantity, 6q^. Let 
us define it as, if M is any deterministic (psd) matrix with |||M|||2 < 1, 



E (l^Y^MYj - E {^Y^MY^) < bQ,{k;Y,) . 



Connection between 6q-^ and bq^. bq^ and bq^ are of course very closely related. Also, in a concen- 
tration context, because y i— t- \/y'My is Lipschitz with respect to Euclidian norm and convex, it is possible 
to derive bq^ for many distributions for which it would be otherwise difficult. For instance Gaussian 
concentration immediately implies deviation bounds and hence bounds on bq-^ for e.g. centered Gaussian 
copulas. 

Let us now elaborate on the relationship between bq^ and bq^- Let us call Qni{Y) = Y'MY, qM{Y) = 



VQm{Y), Am{Y) = Qm{Y)-B iQM{Y)) and 6m{Y) = ,/Qm{Y)-E [^/Q^^j,i.e 6m{Y) = qM{Y)- 
E{qM{Y)). Clearly, 

Am{Y) = {qlj{Y) - [E {qM{Y))?) + [E {qM{Y))? - E {Qm{Y)) 

= hl{Y) [5m{Y) + 2E {qM{Y))] + [E {qM{Y))f - E {Qm{Y)) 
= Hl{Y) [5m{Y) + 2E {qM{Y))] - var {qM{Y)) . 
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Using convexity of x ^ {x]'^ , we conclude that 

'\5m{Y)\'' + 2''\6M{Yt[E {QM{Y))f/' + [var {qM{y))]' 



\AM{Yt<3''-' 



< 3 



,k-l 



Now note that E (Qm(^)) = trace (MS) and that var {qiviiv)) = ^Qi(2; Y). So after taking expectations, 
we have shown that 



bQ,{k-,Y) < 3'-' [bQ,i2k;Y) + 2%Q,ik;Y) [trace iMJ:)f^+[bQ,{2;Y)] 

Also, it is instructive to have a sense of the parameters that impact these bounds and how they grow. In 
the case of normality distributed random variables, QmO^) is a weighted with p degrees of freedom, the 
weights being the eigenvalues of S^/^MS^/^. In this case, we have bQ^{2\ Y) = svly>M:\\\m\\\2=i 2trace ((SM)^) . 
When 1 1 1 M 1 1 1 2 = 1 , it is easy to see that trace ((SM)^) < trace {T?), since if ^ ^ B, and both are positive 
semi-definite, then trace (^^) > trace (-B^). Hence, bq^ = 2trace (S^). 

At this point, one might be concerned about the fact that these quantities will be dependent on extreme 
eigenvalues of S. However, in some situations, we can mitigate this problem. For instance, in the case 
where we assume that the data are i.i.d with the same covariance S, it will sometime be possible to work 
with Y having covariance Id, by simply replacing the shrinkage factor A by ^"^/^AS"^/^, and the vector 
X at which we evaluate the shrunken matrix by This is the case for instance when considering 

x'{t + A)-^x. 



3.2.1 Meaningfulness of the assumptions and applicability 

It is of course important to check that the assumptions we make can be applied to a wide variety of 
situations. It is therefore instructive to give examples at this point. Here are two. 

• Suppose that Xi satisfies > t) < (7exp(— ct**), and Xi has mean 0. Then 

M^;x.)<-g,ir(|). 

• Suppose that Xi satisfies P{\X^v\ > t) < CfK Then if 6 > (/c + 1), 

We note that the condition on the bi^k; XiYs is rather minimal: all we need is some concentration of 
linear forms in Xi. 

The exponential deviation inequality might look like a strong assumption. However, it is satisfied by 
many distributions, with quite non-linear structures which would be difficult to analyze if one did not 
resort to concentration of measure arguments (see Ledoux (2001) for a very thorough reference, and see 
for instance El Karoui (2009a) for spelled-out examples). For the convenience of the reader, here are some 
examples taken from this last reference (justifications can be found there): 

• Gaussian random variables, with |||S|||2 bounded for instance. (Note that this can be relaxed con- 
siderably.) 

• Vectors of the type ^/pr where r is uniformly distributed on the unit (^2-) sphere is dimension p. 

• Vectors X = T^Jpr, with r uniformly distributed on the unit (£2-)sphere in and with FF' = S 
with e.g. |||S|||2 bounded. 

• Vectors of the type X = 1 < 6 < 2, where r is "uniformly" sampled in the ball or sphere 
in RP. (See Ledoux (2001), Theorem 4.21, which refers to Schechtman and Zinn (2000) as the source 
of the theorem and explains the details of the sampling.) 
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• Vectors X with log-concave density of the type with the Hessian of U satisfying, for all x, 
Hess(f/) > cidp (see Ledoux (2001), Theorem 2.7.) For simplicity, though it may not be needed, one 
can assume that |||S|||2 remains bounded. 

• Vectors {X) distributed according to a (centered) Gaussian copula, with corresponding correlation 
matrix, S, having |||S|||2 bounded. In other words, if Z ~ J\f{0,R), X = ^{Z) — 1/2, where $ is the 
cdf of the standard Gaussian random variables. 

• Vectors X = S^/^y, where Y has i.i.d bounded entries . See Corollary 4.10 in Ledoux (2001) for the 
concentration part. Here we crucially need the fact that the concentration of measure results we rely 
on are valid for convex 1-Lipschitz function (and we do not need them for all Lipschitz functions). 

• More "exotic" examples involving vectors sampled uniformly from certain Riemannian submanifolds 
of M^. We refer to Ledoux (2001) Theorems 2.4 and 3.1 for the concentration aspects for these 
questions. 

Bounding of bq^ can either be done directly or using the connection (and bound) between bq^ and 
&Qi we just made explicit. If Xj satisfies a concentration inequality for convex Lipschitz functions, then 
bounding bq-^ is rather simple and this gives us a bound on bq2- We now work out the details of this 
problem. The analysis is standard and follows along the lines of work done in e.g. Ledoux (2001), Chapter 
1. 



An important example: case of concentrated random variables As a matter of fact, suppose 
that Xi is such that for any convex and 1-Lipschitz function /, if A = Aj, 

P(|/(A) -E(/(A)) I > t) < Cexp(-ct^) or P(|/(A) - median (/(A)) \ > t) < Cexp(-ct^) 

Since fv{X) = X'v is trivially convex and ||u||-Lipschitz, we see that if the concentration inequality is 
around the mean, we immediately have 

W*;x.)<|jir(i 

If we "only" have a concentration bound around the median, then we can simply use 
E ( \X'v\^'\ < 2^-^ fE f \X'v - median iX'v) + [median (X'v 



The concentration inequality gives us control of the first term, while |median(A'u) | = |median(A'f) — 
E {X'v) I which is also controlled (see Proposition 1.9 in Ledoux (2001)) or simply 

/»00 /'OO 

Imedian (A'v)-E [X'v) \ < E {\X'v - median [X'v) |) = / P(| A' t> -median {X'v) \ > t)dt <C exp(-ct*(it) . 

Jo Jo 

This is of course nothing else than CT{l/b)/{bc^^'^), and so we have a uniform bound. 

Similarly, when M is a positive definite matrix with |||M|||2 < 1, yOC^MXi is a convex 1-Lipschitz 
function (with respect to Euclidian norm for Aj). Using the fact that for a non-negative random variable 
Z, E {Z^^ = Jq°° kx^~^P{Z > x)dx, we see that, if our concentration result is around the mean. 



bq,iXf,k)=E 



Hence, when Aj satisfy a dimension- free concentration inequality, 6q^(/c; Aj) remains bounded uniformly 
in p and n. Therefore, when trace (S) /n remains bounded as n grows, so does bq^ik; Xi)/n^/'^ , thanks to 
the relationship between bq-^ and bq^ we have highlighted above. 

The conclusion of this short discussion is that random variables satisfying a dimension free concentration 
inequality and having covariance such that {trace (Sj) /n}"^^ remains uniformly bounded in n and p will 
have 6q2(2; Xi)/n and 6l(4; Aj) uniformly bounded (in n). Because we will express later our various bounds 
in terms of these quantities, this observation is very important from the point of view of the applicability 
of our results. 

An important distribution in practice (in particular in financial applications) is the log-normal distri- 
bution. Getting bounds for bi and bq^ here requires work which we now perform. 
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3.2.2 The case of the log-normal distribution 

Let Z = (Zi, . . . , Zp) be a random vector with a normal distribution with parameters /i = (/ij) and 
S = (fJjj). Then the random vector Y := (Yi, . . . ,1^) with := exp(Zj), « = 1, . . . is said to have a 
log-normal distribution with parameters /i and S (see e.g. Mardia, Kent and Bibby (1979), Chapter 2.6). 
Note that the moments of the log-normal distribution are all finite, and can be obtained from the moment 
generating function of the normal distribution. Indeed, for any t = (ti, . . . ,tp) S Nq, we have 

E(y/i ...Yp'') = E(exp(t'Z)) = exp(t'/i + ^t'tt) . (5) 

Set /i* := ||/i||2 and := |||S|||2. Then, for any t = {ti, . . . ,tp) G Nq, we have the estimate 

E(y/i . . . < exp(||t||2/i* + kWtWl^*) ■ (6) 

Put X := y— E(y) (where the expectation is taken componentwise, of course). In this section we will derive 
bounds for the constants 6L(2r, X) and 6q2(2, X) associated with the (centered) log-normal distribution. 

In the sequel we always assume that Z = ji + f}l'^Z, where Z is a p-dimensional Gaussian random 
vector with zero mean and identity covariance. Our derivation will be based on the following result for the 
Gaussian distribution (Pisier, 1986, Chapter 2): If F is a continuously differentiable function and VF is 
the gradient of F (which we always regard as a column vector), then, for any r > 1, 

E|F(Z) - E(F(Z))r < i^.(f )'■ E||VF(Z)|r2 , 

where is the rth moment of the standard Gaussian distribution. 

For any z = (zj) € M^, let exp(2;) := (exp(2;j)) G (by slight abuse of notation), and note that this 
vector-valued version of the exponential function is continuously differentiable and its Jacobian matrix 
D{z) is diagonal with the elements exp(2:j) on the main diagonal. With this notation, Y = exp(Z) = 
exp(/i -|- S-'^/^Z), and we get, for any r > 1, 

E\F{Y) - E(F(y))r < Kri^Y E||VF(y)'D(Z)Si/2||r _ 

We now specialize this result to linear and quadratic forms. 

Linear Forms. Consider the linear form F{y) := v'y, where v = (vi) is a deterministic vector with 
Euclidean norm 1. Then \/F{y) = v, and we get, for any integer r > 1, 

E|F(y) -E(F(y))|2'^ < K2r{^fl\\^\\'2'E{v'D{Z)D{Z)vY . 

Now, using the special structure of the diagonal matrix D{Z) and the bound (6), we find that 

E [v'D{Z)D{Z)vy = Y^---Y^vl...vlE {Y^ . . . Y^) 

ii ir 

< exp(2r/i* + i(2r)2a2) (^^f^ = exp(2r/i* + ^{2rYa^,) . 

Combining these estimates, we conclude that 

E|F(y) - E(F(y))|2^' < i^2r(f )'"c^f exp(2r/i, + l{2ra.f) . 
Since v'X - E{v'X) = v'Y - B{v'Y), it follows that 

bL{2r,X) < i^2r(f )'"^f exp(2r/i, + ^Ira^f) . 

In particular, if /i* and fij are uniformly bounded, this is of the order 0(1). 

Quadratic Forms. Consider the quadratic form F{y) := y'Ady, where M is a deterministic symmetric 
matrix with operator norm 1. Then 'VF{y) = 2My, and we get, for any integer r > 1, 

E|F(y) -E(F(y))|2'- < K2r7^'^''\\\t\\\lE{Y' MD{Z)D{Z)MYY . 
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Observing that Y = D{Z)1, where 1 is the vector consisting of I's, and setting N := D(Z)MD{Z), 
it follows that 

E|F(y) -E(F(y))|2^- < K2rTT^^'\\\t\\\'2B{l'N^n). 

Because most of our bounds depend on bQ,^{2; Xi) only, let us now consider the case r = 1. Note that 



So 



Now Zk + 2Zj + Zi = (cfc + 2ej + eiYZ, so, by (5), 

E (exp^'^+^z.+Zi^ ^ exp((2ej- + Cfc + eiYfl) exp(i(2ej + + eO'S(2ej + + e/)) 

= exp(2/ij + /ifc + /i/) exp(2Sjj + Sfc,fc/2 + S/,//2 + 2t,j^k + + ^k,i) 

Therefore, 



E {Nil) 



^ (^Mfcj exp(/ij + /zfc + Sjj + Sfc,fc + 2Sj- fc)^ (^M,- / exp(/ij + /t/ + Sjj + S/^/ + 2Sj- . 



Let us now write A o B for the Hadamard product of two matrices A and B and for the Hadamard 
exponential of a matrix ^, i.e. the matrix with entries e^*'^ . Let us call A and A the diagonal matrices 
with entries e^^'^ and e'^J+^JJ , respectively. Note that Mfcj exp(/ij + ftk + + ^k,k + '^^j,k) is the 
entry of the matrix A(M o e°^^)A . So 



E fiV^l 



A^V2gOS^-l/2 



(A(Moe°2^)A)2 



Now recall that for any vector x, if is the diagonal matrix with x on its diagonal, (see Horn and Johnson 
(1994), Lemma 5.1.5), 

x'{A o B)x = trace {D^^AD^B') . 

Hence, 



I'E (iV^) 1 = trace (ld„ 



^-l/2gOS^-l/2 



Id„(A(Moe°2^)A)2 



trace 



■^-l/2gOE^-l/2 



(A(Moe°2S)A)2 



Now the Hadamard exponential of a psd matrix is psd (see Horn and Johnson (1994), p. 450). Recall also 
that for A and B psd matrices, A o B is psd (Horn and Johnson (1994), p. 309) and 

|||A O BIII2 = Amax(^ o B) < maxaiiAmax(5) , 

by theorem 5.3.4 in Horn and Johnson (1994). Therefore, since M is psd and |||M|||2 < 1, 

1 1 |M o e°^^ 1 1 12 ^ exp(2 max ■ 

3 

So 

|||A(M o e°^^)A|||2 < exp(2max/ij + 4maxSjj) . 

i j 

So we have, using the fact that when A and B are psd, trace {AB) < \max{B)iva,ce {A), because A^/'^BA^/'^ < 

Amax(^)^, 



trace 



^-l/2gOi:^-l/2 



(A(M o o e°^^)A)2 ) < pexp(4max/ij + 8maxS 
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Combining the preceding estimates, we conclude that 

B\F{Y) - E(F(y))|2 < K^TT^alpexpiAf,, + Sa^) . 

Now set V := 2ME{Y) and note that \\v\\l < 4E||y||| < 4pexp(2/i* + 2a^). Since X'MX - E{X'MX) = 
{Y'MY - F,{Y'MY)) - {v'Y - B{v'Y)), it follows that 

bQ,{2, X) < K2 47r2a2pexp(4/i, + 8a^,) . 

In particular, if fl^ and fij are uniformly bounded, this is of the order 0(j)). 

3.3 On quadratic forms involving (X' X / n + A)~^ 
3.3.1 On forms of the type x' [X'D'^X/n + A)"^ x 

Throughout the proofs, we will make heavy use of the following notation: call, consistently with the 
notations used above, 

1 " 

S = -Y, RlX,X[ ^ X'D^X/n , 

i=l 

where D is a diagonal matrix with positive entries containing the RiS (on its dj^j entry) and X is the n x p 
matrix whose i-th line is X[. We will use the notations 

M = S + A,Ahtldp, 
f{X) ^ x'M-^x . 

To alleviate the notation, we do not show explicitly in the notations the dependence of M on A (and 
therefore, implicitly on t). However, our bounds will involve them, to allow us to show the impact of 
having a small t (a small regularization) , and also to show clearly how x'A~^x affects our bounds. Similarly, 
because we are mostly interested in the impact of the randomness in Xj's on the form f{X) we keep track 
only of this random variable. 
• Concentration aspects 

Theorem 3.1. Suppose Xi, . . . ,X„ G are independent. Suppose further that E (Xi) = and, if v is 
such that \\v\\ = 1, E (|Xj'f|'^) < bL{k;Xi), where bL{k;Xi) is a deterministic function depending only on 
the distribution of Xi and k. Call 

1 

S = — y R^XiX^ , 
1=1 

where Ri are deterministic. 

Call M = S -\- A, and assume that A is positive definite, with A ^ tldp. We also call f{X) = x'M~^x. 
Then, if \\x\\ = 1, 

e(|/(X)-E(/(X))|'=) <^ 

We note that the bound given in the proof below shows the actual dependence of this upper bound on 
x'A~^x. Also, it would be easy to handle the situation where i2j's are random but independent on Xj's. 

Proof. We naturally apply Lemma 3.1 to tackle this problem. Let us call Afj = M — ^RfXiX'^. 
Using the classic rank-1 update formula, 

n l + R^X'^M-^Xi/n 



E 



At' 




- r>2k 

-^bLi2k;X,)At' 
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Therefore, if Z = x'M-^x and Zi = x'M'^x, 



Hence, 



Z-Zi 



\Z-Zi\ < 



{x'Mr^X.f 



n 



l + RfX^M-'X^n 



^{x'Mr'X^f 
n * 



A {xMr-^x) 



because Mj is positive definite and {x'M- ^Xi)'^ < {x'M- ^x){X'-M- ^Xi) by tlie Cauchy-Schwarz inequality. 
Let us call Ej () expectation with respect to Xi only. Clearly, using our assumption on Xi, we have 



E, Mx.'M^-^xn < \\Mr^x\\%L{k;Xi) . 



Hence, 



R 



2\ k 



n 



Ei iZ-ZiH < ^ {x'M-^xYbL{2k;X{)h{x'M~^x 



Now, Mi^A^ tldp, so {x'Mr'^x) < t'^x'A'-'^x and (x'Mr^x) < x'A^^x, using the fact that B i — > -B"^ 
is operator monotone on Hermitian matrices (Bhatia (1997), p. 114). So we finally have the bounds 

E {\Z - Zi\^\Fi.i) < (^^) r\x'A~^xfbL{^- X,) A ix'A~'xf , 



2\ k 



n 



E ( |Z - < ( ^ ) r''ix'A^^x)HLi2k- Xi) A {x'A-'x)" . 



Now recalling Equation (2), we have 

e(|Z-E(Z)|^) <cfc< 

n 

+ E 

1=1 



1=1 



k/2 



n 



2\ r^/ A~l^\k 



x'A~'^x)'' 
IF 



bL{2k-Xi) A {x'A-^x)'' 



Using the fact that A >z tldp and ||x|| = 1, we have x'A ^x < t ^, and this gives the result announced in 
the theorem. □ 

• Lindeberg approach and why the limit does not depend on the distribution of Xi We are 

now interested in showing that for a broad class of distribution for Xi, the limit of 



or more precisely 



x'{X'D^X/n + A)-^x 



E {x'{X'D^X/n + A)-'^x) 



does not depend on the distribution of Xj. We have already seen that we can control the fluctuation of 
x'{X' D'^X/n + A)~^x around its mean for a broad class of distributions, so all we need to show is that 
they all have the same means. 

We have the following theorem. 

Theorem 3.2. Suppose Xi are i.i.d and Yi are i.i.d and follow the assumptions mentioned above (at the 
beginning of Subsection 3.2). Assume that D is a deterministic diagonal matrix, whose diagonal entries 
are positive and denoted by Rj. We assume that A is a positive definite matrix with A ^ tldp, for some 
t > 0. 

Then, for any given vector x, if f{X) = x'{X' D'^X/n + A)~^x , 
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|E {f{X) - f{Y))\ < ^ U,{Xj) + U,{Yj) where 



Let us discuss briefly this result. We see that assuming maxj |||Sj|||2 is bounded, and making assump- 
tions on h]^ and hq^ that match the Gaussian situation (i.e hi and hq^jn uniformly bounded in n), the 
upper bound on the error is of the form (up to constants) 

i=l 

If the i?i's are given by square-integrable i.i.d. random variables (the same for each n), we have 

Rf 



E 



r>2\ 

n J 



Hence, when this is the case, and the assumptions of our discussion are met, we have 

E if{X) - f{Y)) ^ , 

where E (•) is here expectations with respect to all sources of random variables (i.e iij's, Xi's and 1^'s.) 
Simple computations also show that if i?i's are random and have 2 + e moments, with e < 2, 

^ , Rf Rf\ K 
E — ^ A ^ < 



Hence, when this is the case, we have 



n3/2 n - n^+^l^ ' 



E(/(X)-/(y))^0 



1/2 

provided that bi and hq^ do not grow too fast to infinity. If we are in a situation where Yj = Ic 







where Yq is such that hL{k\YQ) = 0(1) and hq^{k;YQ) = 0(1), the theorem can handle the case where 
lll^jllb ^ n''/^ (which allows |||Sj|||2 go to infinity). Note that because we are interested in covariance 
matrices, we will always require Ri to have at least 2 moments and so this theorem essentially covers all 
the cases of interests to us. 

The meaning of the theorem is therefore that under these assumptions, i.e when the upper bound 
goes to for Yj and say Xj are gaussians, all we have to do is simply to understand E (/(X)) when 
X is Gaussian. For this task, we can use many of the nice and well-known properties of the Gaussian 
distribution (which include strong concentration properties). 

Proof. It is clear that E {f{X)) exists since A is positive definite. We employ the Lindeberg approach 
(Lindeberg (1922), and e.g. Stroock (1993)) to show that the limit does not depend on the distribution of Xi 
(note that this technique has been used in other random matrix theoretic questions, e.g. Chatterjee (2005), 
though the results of this paper do not seem directly applicable; note also that here all our expansions are 
exact whereas often in the Lindeberg method Taylor approximation arguments are used. That is why we 
choose to present such an approach.). Let us call 

Zj = {Yi,Y2, . . . , Yj-i,Xj, . . . , Xn) , 

with the convention that Zi = {Xi, . . . , X„) and Zn+i = {Yi, . . . , y"„). Clearly, 

n 

E(/(X)-/(y)) = J]E(/(Z,)-/(Z,+i)) • 
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Now let us call Mj = A + Z'-D'^Zj/n - RjXjX'j/n. Note that 

f{Z,) = x'{Mj + R]XjX'j)-^x , f{Z,+i) = x'{Mj + R]Y,Y^y^ 



and Mj is independent of both Xj and Yj. Therefore, using the fact that (M + uu') 
M-^WM-V(1 + u'M-^u) (see Horn and Johnson (1990), Chapter 0), we have 



f{Z,) - f{Zj+^) 



n 



{x'Mj^Yjf 



^1 ^/ 



1 + -i-Y'M-^Y^ 1 + --i-X'M-^X^ 



Since Yj and Xj have the same covariance matrix, T,j, if we call dj = trace fiVi^. ^S^j, and qj{Yj) 



we see that 



where 



/n 



l + R]qj{Yj)/n 1 + R]dj/ 

idj-qj{Yj)) 



5j{Y,) :-- 



Hence, we see that 



Therefore, 



(x'M-iy,)2 _ {x'M-'Yjf 



{1 + R]qjiY,) Mil + R]d,/n) ■ 

+ }^R]{x'M-%f6,{Y,). 



(8) 



(9) 



f{Z,) - f{Z,+,) = -A 



[x'M-^Y.f {x'MT^X.f 



1 + -itd^ 



i?2 

1 + -^d, 



R' 



+ 



{x'M-^Y,f5j{Y,) - {x'M;^X,f5,{X,) 



Interestingly, the first term in the above expansion, has mean 0, since our assumption of independence 

(on Xj^s and Yj^s) guarantees that Mj is independent of both Yj and Xj. So we have shown that 

n n 

E (/(X) - f{Y)) = (/(Z,) - /(Z,+i) - 7^,(l)) = (7^,(2)) . 

On the one hand, using the Cauchy-Schwarz inequality, we get 



^j({Y;Mj\f\5j{Y,)\] < ^E, ((i;'MrV)VE.(5.(^i)) 



By our assumptions (3) and (4), we have 



E,- [{Y;M-^x)^) < {x'M-^xfbLi'i^Y,) < 



x'A X 
T 



and 



Bj{6j{Y,)r < bQ,{2;Y,)\\\M-'m<bQ,{2;Y,)^ 



(10) 
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since M"^ ^ A-^ < t~^Id. Putting everything together, and taking expectations over the other variables, 
we finally obtain 



x'A~'^x 



E ({Y'^Mj'xf\5,{Y,)\) < _^^6^(4;y,)^6Q,(2;y,: 
On the other hand, by construction, we have 



l + R]dj/n l + R]qj{Yj)/ 
because both dj and qjiYj) are non-negative. Thus, we see that 



n 



< 1 



t 



and therefore, 



i/?|E ({Y;M-'x)%iY,)\) < bL{2;Y, 



x'A~'^x 
t 



Naturally, the same bounds hold for E ({X'jMr^xf6j{Xj)\. We conclude that 



\Eif{X)-f{Y))\<Y^ 

n 

as announced in the theorem. 



Ri ( x'A ^x 



3/2 \ ^2 



hL{A■Y,)JbQ,{2■Y,)/n]^ 



(x'A-^x 



3/2 V t2 



hL{^■X,)JbQ,{2■X,)/n\^ 



R] x'A-^a 
n t 

R] x'A-^ 
n t 



■^Li:2-Yj] 



'-bL{2;X,) 



(11) 



(12) 



(13) 



□ 



3.3.2 On quadratic forms involving DX(X'D^X/n + A)-^X'D 

We are now interested in quadratic forms of the type 



a'^{X'D^X/n + A) 



.iX'D 



n 



a 



n 



which are very useful when working with both sample means and sample covariance matrices, a here will 
be a vector with norm bounded away from zero and from infinity in most cases. Hence, we will focus 
without loss of generality on the case ||a|| = 1. 

Our strategy is once again to use the Lindeberg method in connection with Efron-Stein type variance 
bounds. 

Before we turn to the technical aspects of the questions, let us make a bit more explicit our motivation. 
Let us call, if Xj = /x + RiXi, D a diagonal matrix containing the -Rj's, and 1 is an n-dimensional vectors 
having 1 in all its entries, 

g = -X'X - JlxJl'x = -X'D^X - \x'D'll'DX . 



n 



n 



S is naturally the covariance matrix of our data (we assume that we observe the Xj's). Without loss of 
generality, we can assume that fi = and do so from now on in this discussion. Let us call ju = X'D'l/n, 
the mean of the vectors RiXiS. Suppose we are interested in 

/I'^(S + = (/i + /i)'(S + A)-\ii + /2) . 

These quantities occur naturally in various optimization problems, as well as in theoretical investigations 
of classification problems. Calling as before 

M = X'D^Xjn + ^ , we see that T, + A = M-^J1', 
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and hence, using the rank-1 update formula, 



1 - ll'M-^Jl 

Spehing out M and fl, we see that 



a 

n ' ' ' \/n 



with a = 1/ y/n. Hence our motivation for understanding these problems. 
Naturally, we will also be interested in 



1 - Ji'M-^fi ■ 

and 



1 — n'M ^ji 



• Lindeberg Approach 

We are now interested in 

DX X'D 
g(a; X) = a—^iX'D^Xln + AY^—^a . 



The entries of D are assumed to be deterministic and non-negative at this point. It is clear that this can be 
done without loss of generality, since {Da)i = di^iai (so negative signs in D could be handled by changing 
the corresponding signs in a, which would not affect ||a||). 
Let us observe that 

\g{a-X)\<\\af . (14) 

Indeed, setting 

M = {X'D^Xjn + ^) ^ , (15) 

we have, since M ^ {X'D'^X/n), 

DXM^^X'D < Id„ , 
since Id„ is greater in the Loewner order than any projection matrix. 

Theorem 3.3. Suppose Xi are i.i.d and Yi are i.i.d and follow the assumptions mentioned above (see 
Subsection 3.2). Assume that D is a deterministic diagonal matrix, whose diagonal entries are positive 
and denoted by Rj. We assume that A is a positive definite matrix with A ^ tidp, for some t > 0. Let us 
call, for a deterministic vector a with \\a\\ = 1 (without loss of generality), 

D X X' Jl 

g(a; X) = ol^iX'D^Xln + AY^^a . 

Jn Jn 



Then 

n 

\E{g{a-X)-g{a-Y))\<Y,U{Xi-Ri-ai) + UiYi;Ri-ai) , (16) 



i=l 



where Ui{Xi; Rf, Oi) are deterministic quantities depending only on the distribution of Xi. We have, for a 
numerical constant K that does not depend on the distribution of X^ and Yi, and not on n or p either, 

J2u{Xi-,Rf,a,) < KY,(^-^^^bQ,{2-,X,)/nAlj (^af + ^^6^(4; X,)J . 
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Once again when the Ri^s are random (but independent of {Xj}"^^ and {5^}"=^, it is clear that under 
minimal assumptions on the existence of moments for Ri, the right hand side will converge to 0. Suppose 
for the moment that bQ^{2; Xi)/n and bL{4;Xi) are uniformly bounded and that the Ri are random and 
uniformly square-integrable. Then we have 



.1=1 



' A 1 1 an < 

n 



i=l 



and 



ME 



n In 



oil), 



so that the upper bound converges to zero in i?j-probability (and also in expectation when the expectation 
is taken over i?j's, Xj's and Yi's). Let us now prove this theorem. 



Proof. Let 



M := X'D^X/n + A and m := X' Da/ ./^ . 



Also, let Mi and rrii be the corresponding functionals for ^(j), where := X^^yjej-Xj. In other 
words, is obtained from X by setting the iih. row to zero. Clearly, we have 

M = Mi + \RjXiX[ and m = + -^a.RiXi . 

Note that is independent of Xi and so are Mj and rrii. After computing the rank-1 perturbation for 
{X'D'^X/n + ^)-\ we get that 



g{a-X) = ^a'{DXi^^+R,eiX[) 



.1 Rf Mr'xalM-' 



n 



1 , r,2 X^M-'X, 



{X[^^D + R,Xie'i)a . 



A straightforward calculation shows that, if gi{a;X) = j^a' DX(^i^M- ^X',--.Da, we have the key estimate 



g{a;X) = gi{a;X) + a. 



— -i=Ci) 



l + ^q^{X,) ^ 



(17) 



where 



Ci{Xi) = X[Mr^mi and q,{X.{) = X'^Mr^X, . 



We are now interested in g{a; X) — g{a; Y). Calling Zj = {Yi,Y2, . . . , Yj^i,Xj, . . . , we write as 
before 

n 

E{g{a;X)-g{a;Y)) = Y,^i9{a;Zj)-g{a;Zj+i)) . 

i=i 

It should be noted that the expansion we just got for g{a; X) as a function of Xj also holds if we replace 
X by Zj. 

With our decomposition (17) above, we immediately see that 



gia;Zi) - g{a;Zi+i] 



1 + ^q.iX,) 



1 



l + ^QiiYi) 



(a. - ^Q{Y,)f 



n 



where now Mi and rrii are computed from Zi instead of X. Note that Ej {Q(Xi)) = = Ej {Q{Yi)) and 
Ej (^(f{Xi)^ = Ej (^(f{Yi)^ because the two have the same covariance. 
Now let us call 

^,{Xi) = {a,-^UX^)f , 



n 
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and let us define qi{Xi), di and 5i{Xi) as in the proof of Theorem 3.1. Then, using Equation (8), we have 



1 + -^QiiXi) 1 + -^d 



n 



and therefore 



So we clearly see that 



n 



Bi ig{a; Zi) - g{a; Z,+i)) = - ^i(y,)<5i(l^,)) 

n 



Recall that we have shown earlier that 



E.(5.(X0^)<^«a^and^' 



and ^\6iiXi)\ < 1 . 



Recall also that Q = X'-M- mj. It is clear from (14) that 

\\Mr^m^\\ < \\a\\/Vt = l/Vi 

Hence, 

Ei(\aXit) <t~^/%L{k;X,) . 
Using Holder's inequality, we therefore see that 

Ei {\UX^mX,)\) < j^bQ,i2;X,)^af + ^bL{A-X,)/t^ < ^^6q,(2;X,) (^a? + M^5^(4; x,)/*^^ . 
By Equation (12), we also have 

^R^\MXi)Si{Xi)\ < \Mx^)\, 

whence 

E, {^Rj\UXMXi)\) < 2 (^a^ + ^B, {Q{X,)f^ < 2 (^a? + S.bL{2;X,)/t^ < 2 (^a^ + ^ ^^^(4; X,) , 

since &l(2; Xj) < ^6^,(4; Xj) by the Cauchy-Schwarz inequality. Since similar estimates hold for ipi(Yi)6i{Yi), 
it finally follows that 

\B{gia;X)-g{a;Y))\<Kf;^ (-^JbQ,i2;X,)/n Al) + ^^M^)) 

i=l 



+ 



n 



□ 



• Efron-Stein aspects We now turn to the Efron-Stein aspects of the problem, namely we show that 
our statistic has small variance. 

Theorem 3.4. Suppose Xi are i.i.d andYi are i.i.d and follow the assumptions mentioned above. Assume 
that D is a deterministic diagonal matrix, whose diagonal entries are positive and denoted by Rj. We 
assume that A is a positive definite matrix with A ^ tidp, for some t > 0. Let us call, for a deterministic 
vector a with \\a\\ = 1 (without loss of generality). 



DX 

g{a- X) = a—^{X'D'^X/n + A) 



^iX'D 



n 



a 



n 



Then we have, for a certain constant K, 



var {g{a; X)) <kY^ 



i=l 



RfbQ,{2;X,) , R: 



n nt^ 



+ -|6l(4;X,)-2 +a: 



1 , 2^2f,i(2;X, 



t' 



n t 



A 1 
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Before we turn to the proof, let us show that when i?j's are independent and have two moments, the 
upper bound converges to in (Ri-) probabihty, when 6^(4; Xj) and hQ^{2]Xi)/n remain bounded as n 
grows. Using the Marcienkiewicz-Zygmund strong law of large numbers, we know that 

n ^4 

> — )■ in probability. 

i=l 

Now, suppose for the moment that 6q2(2; Xi)/n and 6l(4; Xi) are uniformly bounded and that the Ri are 
random. Since 

E {^afRf A 1) = E [i^atRf A l)l{afR^/n>i}) + E {ik^^fRf A l)l|„4^4/,,<i}) 
< P {a^R^/V^i > 1) + E {^a^R^) < 2a^B {R^) 

and Y^=\ = 1, we see that 

R"^ 

Eaf — - — > in Ri — probability . 
n 

Proof. A little bit of care is needed to handle the situation where ||a||4 is not small - otherwise the result 
could be obtained in a slightly easier fashion with slightly coarser bounds. Recall that 

g{a; X) - gi{a; X) = , 

1 + Rfqi(Xi)/n 

and therefore 

g{a;X) - gi{a;X) = af(l ^) + L_(i?2/^^2 _ 2aiRi/V^Q) ■ 

1 + IT^i 1 + 

Thus, if we set T := g{a; X) and Tj := gi{a; X) — af{l K — ), which does not depend on X^, we have 

IH — i-di 

n ^ 

T-T,= aj^ J'^^"^ + -^^{R^/nC! - 2a,R^/^Q . 

So, using the bounds used in the proof of the previous theorem, 

(IT - < K (af §6q,(2; + §5.(4; + ^^^) " 

Using the fact that < T = g{a\ X) < 1 and < gi{a; X) < 1, we also have \T — Ti\ < 1 + af, and hence 

Bi{\T-T,\^)<2B,{\g{a;X)-gi{a;X)\^ + aj) <4. 
Thus, the Efron-Stein inequality gives us 



var 



□ 



• Gaussian computations To understand the form we care about, it is now sufficient to compute its 
mean in a simple case. We naturally turn to the Gaussian case for this final task. 

We now compute E {g{a;X)) when the Xi's are independent with (mean 0) normal distribution and 
possibly different covariance. Let us call 

Pr = -DXM-^X'D' , 

n 

with M = X'D^X/n + A. is a n x n matrix. We have the following result. 
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Lemma 3.2. Suppose that Xi are independent normally distributed random variables, with mean and 
covariance Sj. Then E (Pr) is diagonal and 

n 

B{g{a;X)) = ^a^B{PR{i,i)) , 

i=l 

where 

PR{i,i) = ' ^ 



1 + ^X[M-'X, ' 

andM, = lY.j^.^R]X,X'^ + A. 

A particularly interesting case is that where Xj are exchangeable (so for instance, we now allow the 
covariance to be random with a certain prior, and conditional on Sj, Xj^s are A/'(0, Sj) - the resulting 
random variables being exchangeable), and so are i?? (which are assumed independent of Xj's). Then we 
have (if E (•) is expectation with respect to all sources of randomness) E (Pr(i,z)) = E (-Pr(j, j)), for all 
(z,j). In this case, we also have 

E [aPRj3) = a'p ^1 - E 
Therefore, if a'/3 = 0, we have 



l + ^X[M~^Xi, 



E {aDX{X'D^X/n + Ay^X'Dl3) = . 

Another very interesting case is the situation where i?j's are non-random (or random but independent 
of Xj's) and Xj's are i.i.d. Then, 

^ / 2 ^ 

B{g{a;X)) = \\a\\l-Y,^ 



0:7 



i=l 



l + ^XlMiiA)~^Xi 



Now, when ||Sj|| is not too large (i.e o{p^/'^~'^), i] > 0, it is easy to see (by concentration of Gaussian random 
variables, see Ledoux (2001) and El Karoui (2009a) for details of the application) that X'-Mi{A)~^Xi/p is 
concentrated around its mean, which is trace (^Mi[A)~^'Ei/pY When Sj = S, this quantity has a limit as 
n and p tend to 00 with p/n ^ p, and this limit is known (see e.g. Marcenko and Pastur (1967); Silverstein 
and Bai (1995)). As a matter of fact, then 

trace {Mi{Ay^Y,) = trace (x'^D^Xo/n + S-^/^^S^^/ 



where Xq are i.i.d AA(0,Idp) and Di = D — Rieie'^. Calling L this limit (which naturally depends on the 
distribution of -R's), we have 

n 2 

gia; X) ~ 1 - V . 

1=1 ' n t 

Proof. Notice that 

PniiJ) = ^R^RjX'i{M'M/n + A)-^Xj . 

Now changing Xi into —Xi does not affect the term M' M/n+A = ^ Ym=i Pi^i^i+^j but changes the sign 

oiPR{i,j). Ontb 
we conclude that 



of PniiJ). On the other hand, {Xi, . . . , Xi_i, Xi, Xi+i, . . . = {Xi, . . . -Xi,Xi+i, . . . So 



PR{hj) = -PR{i,j) when i / j . 
Now it is easy to check that in the positive semi-definite ordering, ^ Pr. So |||Pr|||2 < 1- So in 
particular, all of its entries are less than 1 in absolute value and therefore have moments. 
So we have shown that when Xi are independent mean Gaussian variables, 

B{PR{i,j))=Oifi^j . 

And we therefore have the proved the lemma. (The description of the diagonal comes from using rank-1 
update formulas.) □ 
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3.3.3 On n-'^l'^cx'DX{X'D'^Xln + Ay^x 

These forms naturally occur in the study of quadratic forms involving both the sample mean and the 
sample covariance matrix as we explained at the beginning of Subsubsection 3.3.2, hence our interest in 
them. 

Therefore, for our applications, we also need results about the quantity 

/i(a; X) := ■nT^I'^d DX{X' D'-Xjn + A)-^x . 

where q and x are deterministic vectors, whose norm we will generally assume (without loss of generality) 
to be 1. 

Note that \i M = X' D'^Xjn ^ A, \\\M-^I'^{X' D'^Xln)M~^l'^\\\2 < 1, and hence, 

\h{a;X)\ < \\a\\ y/x'M~^x < ||a|| Vx'A^^x < 1/Vt, (18) 
which follows from the Cauchy-Schwarz inequality, and (14). 

Concentration Our first aim is to show that h{a; X) is also essentially deterministic. 
Theorem 3.5. Under our usual assumptions (stated in Subsection 3.2), we have 



E (/i(a; X) - E (/i(a; X))f < ^ 



t 



+ -^RtbL{4,X, 



t^ 



A {x'A'^x) 



Proof. This is an application of the Efron-Stein inequality. Let M and Mj be defined as in the proof of 



Theorem 3.1, and let m := n ^/'^X'Da and 



m,- := n 



^^'^X'^.^Da, where X(,d is defined as in the proof of 



Theorem 3.3. Using the rank-1 perturbation formula once more, we get 



h{a;X) := rn'M'^x = [m'^ + n~^l'^aiR,X[) 



M: 



n 



1 + m 



n 



X . 



A straightforward calculation shows that, if hi{a;X) = m[M- ^x, 



h{a;X) = hi{a;X) + 



l + lRfq,{X,)' 



where 



ipiiXi) := (ii-^l^aiR^X[M-^x - \Rj X[Mr^miX[M7^x) . 
Note that hi{a]X) is independent of Xi here. Thus, the Efron-Stein inequality yields 

var(/i(a;X)) <X;var(Ma;X) -/i,(a;X)) <X;E [—^^iL^^ 

Now, on the one hand, using (3), we have 

E {X[M-^xf < bLi2,Xi)x'A-^x/t , 

E {XiMr^niiXlM-'x)^ < ^6^(4, X,) (x^-i^/t) V^l(4, ^i) A' > 



(19) 
(20) 



and therefore 



E 



< E {^,{X,)r < 2 ( la^R^bU2,X,)^^ + ^Rfb^^, X.; ^''"^ 



l + lR^q,{X,) 
On the other hand, it follows from (19) and (18) that 

/ X 2 

^i{Xi) 



t 



t^ 



(21) 



E 



l + ii?2g,(X,)^ 

The proof is completed by combining these estimates. 



< 2 E {h{a; X)y + E {hi{a; X)y < ixA'^x 



□ 
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Lindeberg approach 



Our next aim is to show that the hmit of h{a; X) does not depend on the distribution of the Xi. 
Theorem 3.6. Under our usual assumptions (stated in Subsection 3.2), we have 



|E (/i(a;X) - h{a-Y))\ < K^Uj{Xj) + Uj{Yj) , with 



Uj{Xj) < K 



nt 



6q(2;X,) Al 



Proof. We use the notation from the proof of Theorem 3.2. Using the decomposition (19) with X replaced 
by Zj, Zj^i and observing that hj{a; Zj) = hj{a; 2^+1 ), we get 

n 

E (/i(q; X) - h{a; y)) = ^ E {h{a; Zj) - h{a; Zj+i)) 



i=l 



+ '-Rh^{X^) l + ^R%{Y,) 



where (pi{Xi) is defined as in (20), but with X replaced by Zi. Next, using (8), we have 



^i{Xi 



ipi{Xi) (fiiYi) 



l + l^Ufq^iXi) l + j^Rfq,{Y) \l + ^Rfdi l + iHid.^ 



+ ^{ipi{Xi)5i{X,) - ipiiYi)5iiY,) 
n 



Since Xj and Yi both have mean and covariance S,-, it follows that 



E,; 



'fi{Yi) 



l + l^R^q,{X,) l + ii?2^,(y,) 



E,; 



n 



^,iXi)5iiX,) - ^,iYi)6,iYi) 



Now, on the one hand, using Cauchy-Schwarz inequality as well as (10) and (21), we have 



E,i\^,{Xi)5i{X,)\) < (Ei{6i{X,)fEi{ip,{Xi)y 



1/2 



K 



t 

K 
1 



<^-JhQ{2-X,)[^a}RjhL{2,X, 



t 



+ ^RtbL{i,X, 



t2 



< ^V^q(2;X,) I ^\a,\R,y^bLi2,X,)\l + V^'l(4, X,)^^^ 



On the other hand, using (12), we get 

^RjEi{\^i{Xi)6i{Xi)\)<Bi{\ipi{X,)\) < j^\ai\RibL{l,Xi] 



^ ^ 1 d2 



t 



+ j-RtbL{2,X, 



x'A 



t2 



We now use that bLik,Xi) < ^JbL{2k, Xi). Combining these estimates, we get 

R^ 



^RfB i\ip,iXMX,)\) < K ( ^-^JbQ{2- Xi) A 1 



nt 



\OLi 



R,y/bU2,Xi)\l^^^^ + lR^,^/hKxl 



t 



x'A-^x 



i2 



Since similar estimates hold for (/9j(l^)(5j(l^), this completes the proof. 
• Gaussian computations 



□ 



Consider the case where the Xi are independent normal random vectors with mean zero and covariance 
Sj. Then we clearly have X = —X and therefore 

h{a, X) = h{a, -X) = -h{a, X) . 

But this means that we must have E {h{a,X)) = 0. (Recall that |/i(a;X)| < l/\/t by Equation (18), so 
the existence of the equation is not a problem.) 
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3.4 Forms in M'^T^.M'^, ^ 

In a variety of situations, we will need to work with quantities of the type 

These quantities will occur when we study var {x'M~^e) where e has mean and covariance S^. So these 
quantities will appear when we investigate the risk of various estimators (or asset allocations) . This is why 
we restrict ourselves to ^ 0, though our proofs would go through with minor adjustments if was 
allowed to be more general. 

We will also need to understand 

if we want to understand the risk properties of certain portfolio allocations. 

Hence our problem is the following: in all the forms where before was involved, we now want to 
work with M~^Yi^M~^ instead. Our idea - somewhat similar to the one developed in El Karoui (2009c) - 
is the following: consider 

Mu = X'D^Xjn + A + nSe , and Mq = M . 



We remark that 



u=0 

Hence, at least formally, our previous proofs will go through; the only thing we have to do is replace 
Ahy A + uSg and take a derivative with respect to u so we can get the decompositions that will help us 
make our methods work. 



3.4.1 Forms in x'M-^S.M" 



It is natural to study these forms in a variety of contexts, for instance when x = /x. We have the 
following theorem, which holds under what we now call our "usual assumptions" , namely A y tld, Xi are 
i.i.d with mean and covariance Sj, and so are Yi's, though Xi and Yi have different distributions. 

Theorem 3.7. Let M = ^ Y17=i ^iWi + A and 



Let us call 
Then 

Also, 



F{X) = xM-^Y.,M~^x . 
b{A,^,) = \\\A^^''^^,A^^''^\\\2 

n 

var(F(X)) < K{x' A-^xfh{A,i:^fY^ 



i=l 



44&L(4;Xi) Al 



|E {F{X) - F{Y))\ < U^[Xi) + U,{Yi) 



Ui{X,) < Kb{A-T., 



x'A ^x 



^3/2 



Rf lbQ,{2;X, 



n 



A^bL{2;Xi 
n 



As is explained in the proof of the theorem, when iij's are i.i.d and uniformly square integrable, the 



upper bound goes to zero, provided 



bQ2(2;X,) 



and bL{A]Xi) remain bounded. 



It should be noted that when Xi are i.i.d with covariance S, we have found in Heuristic 2.2 and its 
proof a deterministic equivalent for F{X). Naturally, our theorem shows that doing computations in the 
Gaussian case is enough to understand E [F(Y)) for Y with a variety of distributions - that is the essence 
of Lindeberg-style results. 
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Proof. Let us call 

f{X) = f{X; A) = x'M~^x = x'iX'D^Xjn + Ay^x . 

Call fi{X) the same quantity where Di^i = Ri is replaced by Di^i = (or equivalently Xi is replaced by 0). 
Our key estimate was 

Rl (x'Mr^Xif 

f{x) - u{x) = -i-^^ — ^ . 

n i + ^x[Mr^X, 

This equality is true if A is replaced hy A + nS^. Now we can take the derivative of this expression with 
respect to u. Call 

F{X) = x'M-^T.,M-^x , 
and Fi the same quantity when Xi is replaced by 0. We have 

d 



F{X)-F,{X)= 



[f{X-A + uE,)-h{X-A + uT.,)\ . 

u=0 



Recall the notations qi{Xi) = X'-M^ ^Xi, di = trace (M^ ^Sj). After taking the derivative, we get 

Rfx'Mr^Xix'M-^j:,M-^X^ Rf [x'M-^Xif XlMr^^M'^X, 
- Fi{X)) = 2^ ^2 



Let us call 



UXi) = X[M7^^,M7^Xi , 
di{Xi) = trace (SiMr^SeM^-^) , 

. R2 x'M-^Xi x'M-^^Mr^Xi 



E-i = 2- 



Rf {x'Mr'XifqijXi) 



(1 + 



Control of Ei By using the fact that \v'M-^Xi\ < J v' M~^vJ X'M~'^Xi, we see that 



\Ei\ < 2^ x'Mr^x^ x'M;^'^{M;^''^ll,M;'^'^fM;^'^x 



On Mr^EeMri We first note that M- ^^^M- ^'"^T^.M^ ^^^M- < M^"^|||M. ^^^S^M- ^''^\\\2. Now 
|||M-'/'s,M-'/'|||2 = Amax(Mri/25]^j^-i/2) ^ A,^ax(Se/'M-iSe'/') by e.g. similarity. NowSyV-^Sy ^ 
so 

We therefore also have 

|||M-iS,Mri|||,<^(^. 

We will also repeatedly need to control ||M~^SeAf-~"'^w|| for a fixed vector v. Call u = M~^T,^M~^v. 
Clearly 

"I/O 1 /O 1 

Now, using our bounds on |||(M. ' S,M. ' )\\\2, |||M^^|||2 < t and the fact that ||| • |||2 is submultiplica- 
tive, we have 

-l/2v A/r-l/2x,,r-l.,."l/2^ ,^-1/2m,I / ^^(^;5^e' 



\\\{M7^''i:Mr' )M-'{M,''T.Mr'')\\\2< 



t 
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So 



Finally, we conclude that 

||M-iS,Mri.|| < ^i^llMri.ll < M^V^TI^ 

Using the previous bounds, we clearly then have 

1^1 1 < 2x'A-^x . 
On the other hand, using the Cauchy-Schwarz inequality, we see that 



Hence, 



bLi2k-Xi) . 



:bLi2k;Xi)Al 



Control of E2 Writing 



" (1 + f g.) ^ (1 + f g.) 



we remark that 

" " (1 + f 5.) 
Hence, we can conclude that 

We also have the inequalities 



< and |^2| < 



R^ {x'M-'Xif 



n 



Rf 



(1 + ir^^^ 



E2\ < x'A-'x 6(A,S,) . 



R, 



2k 



^^(\E2f) < ^-\bL{2k-X,){x'Mr^xfh{A-T.,Y < :\bL{2k; Xi){x' A-'x 



2k 



.,.,,b{A;K)' 



Therefore, 



E(|^2r) <K{x'A-^x)'' 



R 



2k 



fk^k 



bL{2k-Xi)M 



• Efron-Stein aspects 

Using the Efron-Stein inequality, we have 



var {F{X)) < ^ var {F{X) - F,{X)) < K{x' A-^xfb{A, T.,f ^ 



i=l 



i=l 



^^bU^;X,)Al 



Hence, when i2j's have 2 moments, var(F(X)) — )• in i?j-probability. 
• Lindeberg aspects 

We go a bit fast here. Mj is now computed from data Xi, . . . , Xi_i, 0, li+i, . . . , Yn- Recall that 



EiiX,) = 2 



Rf x'M-^Xi x'M-^^M'^Xi 



n 



1 + ^q,{Xi 



Let us show that, when Yi and X/ have the same covariance Sj and mean 0, we can control 

n 

Y,^{E,{X,)-E,{Yi)) . 



i=l 
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We call Ni = x'M-^Xix'M'^J^.Mr^Xi and (as before) di = trace (SiM"^). We have 
EiiXi) = 2— ^ + 2—Ni{Xi) 



1 + 



Note that Ej (A'j(Xj)) = Ej {Ni(Yi)), so to control E {Ei{Xi) — Ei{Yi)), we just need to understand the 
second term, namely 

(l + ^d.)(l + ^q,(X,)) 



{d, - QiiX,)) 



{l + ^d.){l + %{X.)) 



Our studies in Subsubsection 3.3 show that 

S;{Xi) = 

is such that 

\R^/n6i{Xi)\ < 1 . 

On the other hand, we have essentially given bounds earlier for Ej (lA^jl'^) (see the work on Ej (|£'j|'')), so 
we have 

E.(|^l(^^)|) <i^^6L(2;X,)^^^6(^;S,) . 



n 



t 



Furthermore, 



So 



r>4 

|^i(Xi)l < K^\Ni\\di-qi{Xi 



E. (|7^l(X,)|) < K-fx/E,(|d,-g,(X,)|2)^E, {N^) . 
Using our bounds on Ej (A^f) and those on Ej (|(ij — (7j(Xj)p), we get 



E, (|7^l(X,)|) < K^^^biA; S J ^^Q^p^ y5l(«) . 



We conclude that 



|E(Ki(Xi))|<K^^^K'4;S.) 



fi'^/^ V n r n 



and similarly for Ij. We have shown that 

|E (i?i(X,) - Si(y,))l < K^^-^b{A; S,) 



n'^/^ V n i n 



and we can therefore control 



Y,^{E,{X,)-E^iYi)) 



1=1 



• About E2 

We now turn to the E2 part of the problem. The strategy is to replace 

qi{Xi) = X'iMr'^T,^Mr^Xi by the "equivalent" (and independent of Xj) 
di = trace (SjM-^SeMri) , 

and similarly to replace qi{Xi) = X[M^^Xi by di{Xi) = trace (SjM-""'^). Hence the first term is going to 
have the same mean for both Xi and Ij and we just have to work on the remainders. Let us call 



Aj(Xj) 



{l + ^q,{Xi)f [l + ^d^Xi)) 



Rf 
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and let us remark that, with the Si{Xi) notation we just recalled, we have 

1 1 



n 



+ 



1 + ^q^{Xi) 1 + ^di{Xi) 



With this notation, we have 
E2{Xi 



^^^,(x,) + 7^2,l(x,)+7^2,2(x 



+ ^{x'M-'X,Y 



(1 + ^d,? 



Note that by construction Ej {Mi{Xi)) = Ej {Mi{Yi)), so to bound E {E2{Xi) - E2{Yi)), all we wih have 
to do is bound E {\Tl2^i{Xi)\) and E (\'R-2,2iXi)\) . Before we turn to this task, let us recall that 

\E2{X,)\ < ^{x'Mr^X,fb{A,J:,) . 

In other respects, if A and B are positive semi-definite (psd) matrices and |||i?|||2 < C, then trace (AB) < 
Ctrace (A) (because when A and B are psd, A^^'^BA^^'^ ^ |||i?|||2^). Therefore, 



Hence, 



di < 6M;S,)dj and M^{Xi) < -!-(x' M-'XiV b(A,J:, 

n 



r>2 

|7^2,l(Xi) + n2,2{Xi)\ < K^{x'Mr^Xif b{A, S 



Let us now work more precisely on TZ2,i{Xi) and 7^2, 2(^1) • 
• On 7^2,l(Xi). 

Note that, using |||M^^S^Mri|||2 < b{A;T.^)/t, we have 

Ei {\UXi) - J.(X0|2) < ^!i^^6Q^(2;X,) . 
Using the Cauchy-Schwarz inequality in connection with the previous remark, we get 

<^i.,-2MA;^e)lhQ,{2-Xi) 



^^{\n2,l{X,)\)<^{x'Mr^x) ^ 
Rf (x'A-^x 



n 



< 



3/2 ^2 



n 



• On 7^2,2(Xi). 

Let us first note that 



M|il<6(^;5],) and \A,\ < K^- 



\di - Qi 



di - qi\ 



R? 



Hence, 



\Aiqi{Xi)\ <Kb{A-Y. 



(1 + f ^0 



We can therefore conclude that 



|^2,2(X*)| <K^b{A-^^ 



(1 + ^d,) 
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Using the Cauchy-Schwarz inequality we also get 

Rf b{A;J:,)^,^^_,^lbQ,{2;Xi) 



B,{\n2,2{Xi)\)<K 



n3/2 t 



n 



Rf 



n 



We conclude that if f/^ = E (|7^2,l(^^) + 7^2,2(^^)l). 

|E {E2{Xi) - E2iYi))\ < U^{Xi) + Ui{Y{) 

where 



U^{X,) < Kb{A-T.,) 



x'A~'^x 
t 



3/2 



n 



Rf /6q,(2;X,) 



n 



A^6l(2;X,) . 
n 



Finally, putting everything together we have shown that 

n 

|E {F{X) - F{Y))\ < KY,Ui{Xi) + Ui{Yi) . 



i=l 



We conclude that when Ri are independent and have 2 moments, the upper bound goes to zero in Ri- 
probability, provided 62,(4; Xj) and bg^i^; Xi)/n remain uniformly bounded (we have already analyzed 
similar series previously). □ 

3.4.2 Forms in a' ^M~^T,,M~^^S-a 
In the analysis of quantities of the type 

/i'(S + A)-is,(s + A)-^ 
we will naturally have to understand quantities of the type, if M = X'D^X/n + A, 

D X X' D 

G{a;X) = a'^M~^T.M~^^a . 



We work under our usual assumptions, and in particular A >z. ild. 
We have the following theorem. 

Theorem 3.8. Under the usual assumptions of this paper, when \\a\\ = 1, we have, for K a constant. 



var(G(a;X)) < ^l^j, with 



+ 



Vi<Kb^{A;T.,){a 



i=l 
)4 



Rt bQ,{2-Xi) ^ 

2 ^2 



af^bQ,{2;X,)^ + ^bU^;X,)^ + a ^ 
1 



1 ^ 2i?2 6i(2;X,) 



R: 



A 1 



2-^j 



1 



-LbL{4;X,)-^ + af^bL{2;Xi)- 



n 



Furthermore, 



BiG{a■X)-G{a■,Y))\<Y,UiAX^) + Ui,2{X^) + Ui,l{Yi) + Ui,2(Y) , where 
b{A;^,) ( i?2 



1=1 



U^,iiXi) < K 



.(2;X,) 



nt 



A 1 



aj R 



Ri 



1 



n n yjt 



Ui,2iX,) < KbiA-J:, 



2 , R^bU2;Xi 



a,- + 



n 



A1 



'bQ,{2;Xi 



n 



1 / -Rf 1 



\R?. 
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It is shown in the course of the proof that the upper bounds go to zero in probabihty when i?j's are 
i.i.d and uniformly square integrable and 5^(4; Xj) as well as y^bq^i^; Xi)/n remain uniformly bounded. 

We note that in the Gaussian case (i.e Xi are AA(0,Sj)), by the symmetry trick we have now used 
several times, it is clear that the off-diagonal elements of the matrix 



DX 



.iX'D 



n 



have mean 0. Hence, to understand E {G{a;X)), all that is needed is to understand the diagonal entries 
of 



DX 



^iX'D 



n 



n 



If we further assume that Xi have the same S, computations similar to the ones done in Subsubsection 
3.3.2 (also using our derivative trick) and fairly standard random matrix results yield a reasonably simple 
expression. In the interest of space, and since this is a very simple problem, we do not state in more details 
the deterministic equivalent. 



Proof. We use the same trick as in the previous subsection, namely calling 



g{a; X;A + nS,) = a'^{X'D^X/n + A + uS,)^ 



_^X'D 



we see that 



n 



d 



a 



n 



g{a; X; A + nSe 



u=0 



Hence we can use the refined understanding of g we have developed earlier to study G. 
In particular, the key equation in the study of g was 



g{a; X;A + nS^) — gi{a; X;A + nS^) = 



2 (a, - RiCiiXi)/^y 



l + ^q^{Xi) 



with, if Di = D — RiCie^, (i.e Di is D where we replace the entry by a 0) 

QiiXf, u) = Xi [MiiA + nS,)]-^ Xi 



Ci{Xi\ u) = X'i [Mi{A + uSJ] ^ nii, nii 



X'Dia 



n 



Hence, if Mi = X'DfX/n + A, 

d_ 

du 
d_ 
du 



u=0 



qi{Xi) = -XlMr^^,M-^X, ^ -q,{Xi) , 
Ci{Xi) = -XlMr^^^Mr^rrii ^ -Ci{Xi) ■ 



u=0 



Hence, if Gi{a; X) is the same statistic as G{a; X) where Xi is replaced by (and hence it does not depend 
on Xi), we have 



G[a\X) - Gi[a\X) = 2— qi 

^ 1 + ^%(X,) 



[oLi - RiC,i{Xi\u) I ^) \ 
, l^^^cli[X,■u) ) 



In preparation for Lindeberg-style work below, we note that if Yi and Xi have mean and the same 
covariance Sj, 



E, (0(Xi)) = Ei (0(^0) 

E, ( Ci(Xi)c,(Xi)) = Ei Ui[y^)Uy'^ 



E, (Cf(Xi))=Ei {Q{Y,)) 



E, 



(c^(x,)) =E, (c?(y.) 
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Recall also that ||Af. ^^^mi|| < 1, \\M~'^mi\\ < t-^/^, so \\M~'^T,^M~'^mi\\ < b{A;T.^) / ^/t. We therefore 
have the esthnates 

E (|C.(X,)|^-) < bL{k;X,) and E (\Q{Xit) < bL{k; X,)t-'^/^ . 

Let us call, if M^i) = " ^G(^i))', 

i2i Ci(X,)(i?i/V^Ci-«*) 



E2{X., 

Clearly, 



1 + ^qi{Xi) 1 + f 



G(a;X) -Gi(a;X) =2Si(Xi) -^2(Xi) . 

• Efron-Stein aspects 

The aim here is to find Zj^i, independent of Xi such that we can control E (|i?i(Xj) — and similarly 

for E2{Xi), we will try to find a Zi 2 such that we control E (\E2{Xi) — ^i,2p) • This will give us control of 
var(G(a;X)). 

1) Controlling Ei(Xi) Let us call 

rp _ Rj Q;^C^(^^) 

V^l + ^q,{X,) 

Clearly, T^. < afRf/nCf{Xi) and therefore 

This term will not cause problem in our analysis as XliLi ij^'^^ ^^^^ clearly go to zero when i?j's have 
two moments and b{A; S^) as well as bi[k]Xi) remain bounded. (Recall that ||q|| = 1.) 
Let us call 

CiCi 



T2,i — ^2 • 

^ 1 + ^q.iXi) 

Clearly, 

In particular, 

E(|r2,n<4&L(4;X.)-^'^^'^^) 



?4 /^2 



(Since, when i?j's are i.i.d and have two moments, ^iRf/n a.s, this terms is again not going to 
create any problems when we try to control the variance of G.) 
So we have shown that 



E, (EiiXif) < 2 



^bL{4;X,)Uaj^bL{2;X.)\ ^'^^'^^^ 
in 



t 

2) Controlling E2(Xi) Recall the decomposition from the proof of Theorem 3.4 

M^i) af _^.2Rl qi{Xi)-di 1 .p2/ ^2 9^R//;:A^ 

^2 ^2 — - «j ^2 ^2 \ ^2— - 2,aiKi/yJnQi) 

1 + ?^q,{Xi) l + ^d, ^ (1 + + ^d,) 1 + ^g, 



Let us call 
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Let us note that since 
3.4, 



< Sf), we will have, using the work we did in the proof of Theorem 



i2 



1 R 



al^bQ^(2-Xi)-. + ^6L(4;Xi)-2 +a. 



1 , ^RfbL{2;Xi 



As we saw then, these terms will not cause any problem in our eventual control of the variance. 
So we just need to focus on understanding 

Rynq,{X,) af 



7^2,^(^^) 



1 + §q^{X,) 1 + ^di 



R? R^ 

Now recall that we called -^5i{Xi) = 1/(1 + -^qi{Xi)) — 1/(1 + -^di); with this notation, we have 



RVnqjX,) _ Rynd, WQ^d^ R] ~ Rj,,^. 

^2 — ^2 ^ ^2 \ "i Oi[Jii) . 

l + ^q^{X,) l + ^d, ^l + ^qi{X,) ^ ^ 



Hence, 



A 1 



n2,iiXi) 



Rj/ndi of 
1 + 1 + ^di 1 + ^di 



Since di{Xi) < b{A; Tie)di, we have 



R^ dj 



We have seen that 



Ei (\qi{Xi) - di{X^)?) < \\\Mr^^,M-'\\\lbQ,{2;X,) < ^^i^^bQ,{2; X, 



Furthermore, we saw previously that 



E. {6hXi)) <:^bQ,{2;X, 



So we have 



E,; 



^2,^(^^) 



R?di 



r2 p2 



< Ka 



I 9 



t2 



On the other hand. 

So we conclude that 
Ei 

Hence, 

E, 



n2,^{X^) 



r2 r2 
1 + ^di 1 + 



< Kb{A]i:,)a. 



a,- 



o2 p2 

1 + ^di 1 + i^di 



< Kay{A;^ 



Rf bQ,{2;Xi 



A 1 



a. 



l + $dil + ^di 



< Kb^{A-ll,) { a 



Rf bQ,{2;X,) ^ ; 
t2 



n 



+ 



ai-2bQ,{2;Xi)-^ + ^6l(4; X^)^ + ^ 



A 1 
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So if 



Z, = G,{a;X) 



Rid, 



a- 



1 + 1 + ^di 



and Z = G{a;X)^ we have shown that 



Bi{\Z-Zi\^) <Kb\A;^,)\at 



A 1 



+ 



,4^ 



1 i?: 



1 , 2^2^l(2;X,) 



'i2 



A 1 



p4 2 R2 

+ J-6i(4;X0^ + a?^6L(2;Xi)- 



n 



This bound is sufficient to allow us to apply the Efron-Stein inequality, as we saw earlier: as soon as the 
RiS have two moments and are i.i.d, the (sum over i of the) upper bound goes to zero. 

• Lindeberg aspects 
1) Controlling E (Ei(Xi) - Ei(Yi)) 

To alleviate the notation, we make a slight abuse of notation and change the meaning of Mj compared to 
what was used in the previous part of the proof: because we are now in the Lindeberg setting, the matrix 
Mi is (as usual) computed by using (Xi, . . . , 0, Yi+i, . . . , Yn), but it is still independent of Xi and Yi. 
Recall that 

R, Q{Xi){Ri/V^Ci-ai) A N,{X,) 



EiiXi 



Calling as usual 



> l + f%(X.) 

1 



1 + ^qi{X, 



o2 

^5i{X, 
n 



1 + ^qi{Xi) 1 + f ' 



we have 



Ei{Xi) = ^'^^ + ^N,{X,)6,{Xi) . 



1 + -^di 



n 



Of course E (A^j(Xj)) = E (Ni(Yi)) when Xi and Yi have mean and the same covariance, Sj. Therefore, 
since di = trace {^iMr^) = Ej {XlMr^Xi) = E^ {Y/M'^Yi), we have 



E 



N.jXi) 
l + ^di 



E 



N,{Yi) 
l + ^d. 



The only question now is to try to control the remainder term 

7^l(x,) = ^N,{x,)6.{x,) = Ei(x,):^i^i-^i^M 

"l + ^d,(X,) 
We have, after using Cauchy-Schwarz and our usual bounds, 

R^ lbQ,{2;X.. 



Bi{\ni{Xi)\) < 



nt 



Using the results we got earlier on -\/Ej (£^^(Xj)), we have 



Ri 



On the other hand, 



\ni{Xi)\ < \Ni{Xi 



Vt 
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From its definition, we see that 
Hence, 



and 



At this point, we would Hke to show that the control we have is sufficient for the Lindeberg method to 
work when i?,'s are i.i.d and have two moments. For this, it is sufficient to show that 



E K]|a,|(i?i/Vr^Ai?3/n) ^q. 



vi=l 



We will simply show that E [{Ri/y/n f\ Rf/n)) = o(n ^/^). We note that, since Ri > 0, 

E (i?, A n-V2i?3^ = E (i?a^,>„i/4) + E (^Rfn-^/Hj,^^^,,,) < E {R^ln^^^^/,) + n-^/^B ) . 
Since Ri has two moments (and hence one), the monotone convergence theorem guarantees that 

E A n^^/^i?^) =0(1) . 
We now remark that since ||a|| = 1, ||a||i < y/n. Therefore, 

E (j2\a^\{R,/y/^AR^/n)^ = MiB {{R,/V^ A Rf /n)) = o{\\a\\i/n^^^) = o{l) . 

2) Controlling E (E2(Xi) - EalY;)) 
Recall the notation 



Let us write 



By definition. 



Therefore, 



n n 



(1 + 



(1 + ^ ^ (1 + ^di^ 

= M2{Xi) + 7^2,l(X0 + 7^2,2(X^) . 
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It is clear that when Xi and Yi have the same covariance Sj and mean 0, Ej (A^2(^i)) — -^i (-^2(^))- 
Hence, in controlhng E {E2{Xi) — E2{Yi)), all we will have to do is control 

Ei(|^2,l(Xi)+7^2,2(X^)l) • 



a) Controlling E (|7^2,l(Xi)|) 

We have, using the Cauchy-Schwarz inequality: 



Therefore, 



and 



Ei(Ct\di-qi{Xi)\] < 



E, 



{\CiH-qi{Xi)\ 



< 



b{A; 
t3/2 



V^l(4;X0^6q,(2;X0 , 



■y^bL{2;Xi)JbQ,{2;X,) . 



Ri 



■E. 



. (1 + f y' - 



a?; 



We conclude that 



V6L(4;X,)V6Q2(2;^.)/n 



(1 + ^d.)^ 



< 



n 



^bL{2-Xi)JhQ,{2-Xi)/n 



E^(|7^2,l(Xi)|) <K 



■^JbQ,{2;Xi)/n 



Rf 1 
n3/2 ii/2 



b) Controlling E(|7^2,2(Xi) I) 
Since 



^qi{Xi)Ai{X,) 
n 



„ f?2 ' 



we have 



E,; 



n n 



1 + 



Hence, 



E,; 



/\i{Xi)iJi{Xi)^qi[X,) 
n 



c) Controlling |Ei (S2(^^) - ^2(1*)) | 

We note that |-E2(Xj)| < il;i{Xi)h{A;T.^) and therefore 



E. {\E2{X,)\) < Kb{A;^,){aj + ^^^^'^^^ 



n t 



We can finally conclude that 



\Bi {E2{Xi) - E2{Y,)) I < «>i(X,) + $i(y,) 



where 



^>i(X,) < Kh{A-T., 



^ , R'^bL{2;Xi 

1 CV,' ~r 



n t 



A 



bQ2{2;Xi 



n 



Rj (af ^ Rj 1 rr—-^ 
+ — TaV^L 4;X, 
'n \ t n t"^ 



+ - 



1 / 1 



v/6l(4;X,) + ^1^^6l(2;X, 



^3/2 \ ^3/2 ^1/2 



n 
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This expression is somewhat unseemly, however, assuming that bL, hq^/n and b{A]Ti^) stay bounded we 
see that it is of the form 

cI>,(X.) <kI^{^ + a\ + ^)) A + ^) < K (af + A l) . 

\ A/n n X n / n \ n I \ Jn I 



We have aheady seen how to control this expression when Ri are i.i.d and uniformly square integrable 
in the proof of Theorem 3.5. So we conclude that when this is the case X]"^^ ^i[Xi) will tend to (for 
instance in i? j-probability) . 

□ 



3.4.3 Forms in -^a' DX' M-^T.eM~^x 

The third and last situation we need to consider are forms of the type 

H(a;X) = -^aDX'M~^Y.,M-^x , 
'n 



where as usual 



M = -X'D'^X + A 

n 



We work under our usual assumptions, and in particular A ^ tid. 
We have the following theorem. 

Theorem 3.9. Under the usual assumptions of this paper (see Subsection 3.2), we have, for K a constant, 



var {H{a] X)) < ^ Vi, with 

i=l 

x'A^^x 



Vi<Kb'{A;^,y- 



t 



R 



'albL{2;Xi) + ^-^bL{A-X, 



Rf 



n 



|E {H{a; X) - H{a- Y))\ <Y, Ur{Xi) + U^{Yi) , where 



i=l 



U,{Xi) < Kb{A-T,, 



x'A ^x 



t 



ai\R,x/bL{2;Xi) + 
n n 



Rf bL{4;X,) 



t 



t 



n 



The proof of the theorem uses the same ideas as before and will rely on the work of Subsubsection 
3.3.3. 

We also note that by the same symmetry arguments as before, in the Gaussian case, we trivially have 
B{H{a;X)) = 0. 



Proof. Naturally, H{a; X) is closely related to 

h{a;X) 



^ a'DX'M-^x 



n 



which we studied earlier. Recall that we got the key decomposition 



h{a;X;A) = h{a;X) = hi{a;X) + 



1 + ^qi{Xi 



where hi did not involve X- and 



1 i?^ 

UiRiX'.Mr^x '-X[M-^miX[M-^x , 



QiiXi) = X'.Mr'Xi . 



n 
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As before, we can deduce H from h{a]X]A + nS^) by taking the derivative of the latter with respect 
to u and appropriately modifying the sign. 
We can 



Hi 
Ti{Xi) 



_ dhi{a;X-A + uT,^) 
du 

^ dipijXi; A + uJ:,) 
du 

= -^aiX[M-^T.,Mr^x - ^ [X[Mr^T.,M~^rniX[Mr^x + X'iM~^m,X[Mr^T.,Mr^'^ 



qi{Xi) = X[Mr^^,Mr^Xi . 
The new "key equality" is 

H{a;X) = Hi{a;X) 



^i{Xi 



— ^ — + ^uxry 



We are now in a position to do our usual analysis with the Efron-Stein inequality and the Lindeberg 
approach. 

• Efron-Stein aspects Because Hi{a]X) does not involve Xj, we clearly have 



1=1 



var {H{a;X)) < "^var {H {a; X) - Hi{a; Xi)) . 

Now, clearly, 

Vi = var (Hia; X) - Hi{a; Xi)) < K 



E 



l + ^q,{Xi) 



+ ^E 



Xl + ^q,{X,))\ 

X, we have seen that ||?;|| < b{A; Ti^)^/x'A ^x/t.So we conclude that 



Recall now that ||Mr^m,-|| < M < i and llMr^H.M," < 



Vt — ^/t 



'^^^^"^ ■ Therefore, 



E ( [XlMr^J:,M-^miX'iMr^x + X'iMr^rriiX'iMr^^.Mr^x]'^) < KbL{A;X, 



b^{A;^,)x'A~^x 



We finally have 



E 



TiiXi) 



l + ^q,{X,), 



< K 



b'^{A;Y.,)x'A-^: 
t 



1 



-^aibL{2;Xi) + ^bL{A-Xi)- 



n 



For the second part of this simple variance bounding exercise, we first remind the reader that 

<6(^;S,) . 



l + ^q^{X,) 



Hence, we simply need to bound 



E 
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something we have essentiahy aheady done, and we get easily 

2^ 



E 




< K 



n t 



+ 



t2 



We have therefore shown that 



x'A~^x 



n 



We note that when Ri are i.i.d uniformly square integrable, the Marcienkiewicz-Zygmund law of large 
numbers guarantees that '}21=i ^ ~^ for instance in probability. 

We now turn to Lindeberg-type questions. 
• Lindeberg aspects As usual, we will go a bit fast here. Essentially the previous decomposition can still 
be used, but it should now be understood that the Mj matrix we are dealing with involves both {Xm}m<i 
and {Yk}k>i-, instead of just {Xj}'^^^ or {1^}^^^. However, the key fact is that Mj is independent of both 
Xi and Yi. Hence, we have for instance 



and 



Let us call 



E((/Pi(Xi)) = E(v9i(yi)) 
E {Ti{Xi)) = E (T,(y,)) 

1 + f 



Ti{Xi 



and 



It is clear that if we can control 



T2{X,) 



RlJ^iJQiMXi^ 
^ {l + ^q,iX,)f 



|E (ri(x,) - ri(y,)) - e {T2{x,) - T2{Yi))\ 



i=l 



we will have control over |E {H{a;X) — H{a;Y)) \. We recall that we have already showed that 

|x||6(^;S,) 



\H{a;X)\ < K'- 



• Control of E (Ti(Xi) - Ti(Yi)) 

As usual, we use the fact that 



nix,) = ^^^ + T,ix,^^' 



di - qi{Xi 



r2 



^ri,i(x,) + ri,2(xo . 

Naturally, since Xi and Yi have mean and the same covariance. 



E 



^i{Xi) 



E 



so all that is left to do is control E (|Ti^2(Xi)l)- To do so, we can use Cauchy-Schwarz and recall that 



E 



l + ?lqi{Xi), 



< K 



}?{A;T.,)x'A~^: 



R 



Rf 



^afbLi2;Xi) + ^bL{i;X,)- 



n 
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and 



Hence, 



E(|ri,2(x,)|) 



< K 



n 



^ajbL{2;Xi) + ^bLi4;X., 
n 



n tV^ 



—=\ai\^/bL{2;Xi) + 

Jn n \ t 



bQ2{2;Xi) . 



In other respects, let us note that 



E(|ri(X,)|)<E(|T,(X,)|) . 



We have 



\ai\bLil;Xi) + 



Rj 26l(2;X,) 



Hence, 
where 



Vra V t 



\ai\VbLi2;Xi) + 

E (Ti(Xi) - ri(yo)l < 'i'^(^i) + ^r{Y^ , 

i2i V6l(4;X0" 



Ri ^bL{^;Xi 



\aiybL{2-Xi) + 



• Control of E (T2(Xi) - T2(Yi)) 

Recah that 



T2{Xi) 



R\ ~qi{Xi)^i{Xi) 



1 n bQ,{2;X,) 



n 



R? 



^ (1 + ^qi{X,)f 

Clearly, using the notation Ai{X,) = 1/(1 + Rf/nqi{X,)f - 1/(1 + R^/ndif, we have 

QiiXi) di QiiXi) - di 



(l + ^g.(X.))2 {l + ^d,f (1 + i^^,(X,))2 
Now E {ipi{Xi)) = E {ifiiYi)), so to control E (T2(Xj) — T2(yi)), all we need to do is control 

i?2 qi{Xi) - di 



+ 



R? 



■diA,{Xi) 



72,1 (X.) 



n 



R? 



(1 + ^qiiX,)f 



-'^iiXi) 



Ri ~, 



Recall that 



Hence, 



T2,2{X,) = ^d-Ai{X,)^i{Xi) . 
n 



n 



n 



R? 



l + i^q,{X,) 



E {\T2,2{Xi)\) < Kb{A; SJ^E {\q,{Xi) - d,\\^i{X,)\) 

n 

and we have already gotten a bound on E {\qi{Xi) — di\\ipi{Xi)\), so we get 

1 



B{\T2,2{X.,)\)<K^b{A;^, 
n 



,'A-H ^/bQ,{2;Xi) 



t 



t 



ai\R^y/bL{2;Xi) + 
n n 



Ri hL{^-Xi 
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Similarly, using the fact that WE ( \qi{Xi) — djp j < y^6Q2(2; Xi)b{A; T,^)/t, we see that 



n 



B {\T2,i{X,)\) < K 



On the other hand, 



and we have already seen that 



n 



\a^\Ri^/bL{2■,X,) + 



n 



Rf bL{4.;Xi) 



t 



x'A-^x 



t 



\B{T2{Xi))\ <6(A;S,)E(|^i(X,)|) 



E(|,.(x.)|)<W^4(NyM^' ^^^^^ 



t Jn 



So we conclude that 



|E {T2{Xi) - T2{Yi))\ < U,{Xi) + Ui{Yi) 



where 



U,{Xi) = Kb{A-^, 



x'A 



t 



1 



n 



\ai\Riy/bL{2;Xi) + 



n 



Rf hL{A-Xi 



t 



n 



•Putting everything together Since K can be chosen so that ^j(Xj) = Ui{Xi), we conclude that 

n 

|E {H{a;X) - H{a; Y))\ < 2Y,Ui{Xi) + U,{Y,) . 



i=l 



□ 



3.5 Checking the heuristics 

In Subsection 2.3, we gave some heuristics to compute an asymptotically deterministic equivalent of 
forms like x'{X'D'^X/n + A^^x and x'{X'D'^X/n + Ay^T,,{X' D'^X/n + A^^x in the case where all the 
X's have the same covariance. We now prove them rigorously. 

Of course, the centerpiece of our analysis is the fact that this only need to be done in the Gaussian case. 
The proof is somewhat involved, since at the level of generality at which we operate, we cannot seem to rely 
on invariance properties of the Gaussian distribution which were recently systematically exploited in El 
Karoui (2009b), El Karoui (2009c) and have been a mainstay of multivariate statistics (Anderson (2003), 
Eaton (2007), Chikuse (2003)). As is often the case, computing the limit (or a deterministic equivalent) of 
the quantities we are interested in is in fact at least as difficult as showing that the limit does not depend 
on the particulars of the distributions we consider, or bounding the variance (or higher central moments). 

It should be noted that our Lindeberg style results are valid for families when each Xi has a different 
S j . The limits we are investigating here are for the case (mostly encountered or assumed in practice) where 
all the Xj's have the same S. 

Let us begin by clarifying our assumptions and by introducing some notation: We assume throughout 
this subsection that the rows Xj of the matrix X are independent Gaussian random vectors with mean 
and (identical) covariance S. Due to the concentration properties of the Gaussian distribution (see e.g 
Ledoux (2001)), or using the properties of normal and weighted-x^ random variables, this implies that for 
any r > 1, 

E (b'XjD < Kr|b||2ll|S|||2^^ (22) 

for any deterministic vector v and 

E (iX^-BXj -trace (SS)I^) < Krp''/'^\\\B\\\^2\\\^\\\2 (23) 
for any deterministic matrix B, where Kr is a numerical constant. 



44 



Given a matrix 0)^0, put 

Mc := (A + C), 

where A )p tid is our regularizing matrix as above. Note that |||M^^|||2 < t. In the special case where 
C = S{j) :=S- ^R'^XjX'j, we simply write Mj instead of Mgi^j). We now recall the classic rank-1 update 
formula which will again be used repeatedly in this part of the paper. 



m: 



-1 'nR]M-'x,x'^M-' 



(24) 



Unless otherwise mentioned, B is always a deterministic positive semidefinite matrix in the sequel. 
For j = 1, . . . , n, let 

qj := XjM-^Xj , dj := trace (sMr^) , qj ■= X'^M'^BM-^Xj , dj := trace (^M'^BM-^^ . 

In this subsection, we will usually replace qj and qj with the fully deterministic quantities E(dj) and E((ij) 
(instead of dj and dj). Using the fact that B, E and are positive definite, it is easy to see that 



< 



< 1, < 



l + ii?2E(d.) 



< 1 



and 



lr)2- ■ 'Ll^ \ 



l + iRjqr t ' - - l + \R]^{dj)- t 
The following lemma provides some additional estimates which will be used in this subsection: 
Lemma 3.3. Suppose that the above-mentioned assumptions are satisfied, 
(a) We have 



trace 



(^SM-i - EAf-^) I < |||S|||2t"^ 



and 



\trace ( J^Mg^BMg^ - T^MJ^BUJ^) \ < 2|||5|||2|||S|||2t" 



(h) For fixed r > 1, we have 

E {\^trace{T.M~^) - E (trace (SM"^)) < is:;W2|||S|||^r 



and 



E [\trace{T.Mg^BM^^) - E (trace (SM^^^M"^)) fj < K;V/2|||S|||^|||S|||5t-2^ , 
where and K'^ are constants depending only on r. 
(c) We have 



E 



<K'{l^l^R]{^ + ^)\\\^u-^) 



and 



E 



{l + \R]q,? (l + ii?|E(d,))2 
where K' and K" are numerical constants. 



< K"\\\B\\\2t-' (1 A ii?|(vp+ V^)|||S|||2r^) 



(25) 



(26) 
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(d) For any square-integrable random variables Zj such that E (Zj)'^ < L? , we have 



z, 



and 



where 



z, 



0{LU) 



0{LU\\\B\U-^) , 



U■.= Y.[lR]^^R]{^+V^)m\\2t-') ■ 
i=i 

(e) For any hounded random vectors Vj and Wj independent of Xj such that \\Vj\\2 < Li and \\Wj\\2 < L2, 
we have 

n 



and 



where 



V^Mg^BM-^Wj - V-M-^BM-^Wj 



0{LiL2U\\\B\\\2t-^) 



Proof. Throughout this proof, K denotes a numerical constant which may change from step to step. 

(a) At least the first inequality is well known in random matrix theory (see e.g. Silverstein and Bai 
(1995)). We include a proof for the sake of completeness. Using (24), we get 



trace (sM_^^ - SMr^ 



1 + ^R'jX'.M-^Xi 

n J J J J 



< \\\Mj ^''^^M- ^/^|||2 < llisillsr^ 



In fact, this continues to hold for a general square matrix S. It therefore follows that 



trace ( T^Mg^BM^^ - T^M-^BM'^ 



< 



trace (T.{Mg^ - M-^)BMs^) + trace (sM-iS(M-^ - Mr ^ 



< 2|||5|||2llis|||2r 



(b) This is a simple consequence of Azuma's inequality (see e.g. Lemma 4.1 in Ledoux). We follow the 
proof of Lemma 6 in El Karoui (2009a) : For j = 0, . . . , n, let Tj denote the a-field generated hy Xi, . . . , Xj. 
Then, using part (a), we have 

|E (trace (SM^^) | - E (trace (SM^^) \ 

= E (trace (^{Mg^ - M'^)) \tA - E (trace (^(M^ ^ - Mr^)) 



so, by Azuma's inequality, we get 

Pr(|trace (SM^^) - E (trace (SM^^)) \ > u) < 2exp(-uV8n|||S||||t-2) 

for all n > 0. Since E (|.^|'') = /q°° ru^'~^ Vv{\Z\ > u) du for any real random variable Z, the first inequality 
follows easily. The second inequality is derived similarly, 
(c) Recall that from (25), we have the simple estimate 



< 1. 
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Using (23) and part (b), we also have the esthnate 



E 



<^R]B{\qj-B{d,)\) 



< E {\qj - d,\) + ^R] E (l^i, - E {d,)\) < K Ir]{^ + V^MnU"^ 

It follows that 

^' 1 1 



E 



<K{l^^-R]{^ + ^)m\\2t-') , 



(27) 



and the first inequality is proved. For the second inequality, first observe that from (25) and (26), we have 
the simple estimate 

n 3 



1^2.. iR'^ndj) 



i^ + lRh^? (i + ii? E(d,)) 



<\\\B\\\2t^' 



Moreover, writing 



(1 + \R] q,f (1 + i i?| E(rf,))2 (1 + 1 r2 q^f (1 + i E(d,))^ 



+ 



+ 



and using (23) and part (b), (25), (26) as well as (27), we get the estimate 



E 



{l+'-R]q,Y {l + lR]^{d,)Y 

1 



qj-Bidj) 



+ 2|||S|||2t-^E 



< 



^^-^^1(^/^+^^)111^1112111^11^- 



It follows that 
E 



{l + ^Rjq.r (l + iii|E(d,))2 
(d) Similar arguments as in part (c) show that 



<K\\\B\\\2r'{lA^R]{V^ + V^)m\\2t-') . (28) 



E 



^ + -nRh^ l + ^^|E(d,) 



1/2 



<K{IA^R]{^+V^)\\\m2t-') 



and 



E 



R]B{d,) 



I d2 

n 



1/2 



< ii-|iii?iii2t-^ (1 A ^R]{^+v^)\\\m2t-') 



Thus, the claim follows from Cauchy-Schwarz inequality, 
(e) On the one hand, we have the simple estimate 



V'{M^'-Mr^)W, 



< 2LiL2t 



-1 
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On the other hand, usmg (24), Cauchy-Schwarz inequahty and (22), we have the estimate 



E 



Combining these estimates, it follows that 



which establishes the first part of (e). For the second part of (e), write 



E 



< E 



+ E 



V;MJ'B{Ms' - MT')Wj 
+ E (k/(M-^ - M-^)B{M-^ - Mr^)Wj 



By the preceding estimates, the first two expectations are bounded by -i?^LiL2|||-B|||2|||S|||2t here 
For the third expectation, we can use (24) and (26) to get 



1 p4 



\Vj{Mg'^ - M-^)B{Mg^ - Mj^)Wj\ = " ^2 . „ \ViM7^XiX',M-^BM7^XiX'M7^W^ 



< kR'j\VjMr^X,X'^M7^WMB\\\2t-' 
and therefore, by Cauchy-Schwarz inequality and (22), 

E (\V^{Ms^ - M^^)B{Ms^ - M^^)Wj\) < K ^R]LiL2\\\B\\\2\\\m\2t-^ 



Combining this with the simple estimate 

V-M~^BMg^Wj - V-Mr^BM-^Wj 



< 2LiL2\\\B\\\2t- 



it follows that 



^R]E (jv^Ms^BMg^Wj - VlMJ^BMJ^Wj\^ = 0{LiL2U\\\B\\\2r^) 

This completes the proof of the lemma. 

To verify Heuristic 2.1, we will prove the following result: 



□ 



Proposition 3.1. Suppose that the assumptions from the beginning of this subsection hold, the ratio p/n 
stays bounded, \\v\\ = 1 and 



n jj2 

E 



'J NISI 



n 



0(1) and J^^A^IIISI 



n n 



3/2 I 



ofl) 



(29) 



as n —7- oo. Then we have 



where 



E {v'{S + A)-^v) - E {v'{-i{A)^ + A)-^v) , 
1 " 

n ^ 1 + ij 



f^l + ^R^trace{EMs') 
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Proof. We first show that we may replace ^{A) with the deterministic quantity 



1=1 ^ n 



1 p2 



-i?2E (trace (SMri)) " 

To this end, since \v'{A + 'y{A)T,)-^v - v'{A + ^{A)T,)-^v\ < ^-^[^(A) - 7(^)||||S|||2, it suffices to show 
that 



E(|7(^)- 7(^)1) |||E|||2 = o(l). 



(30) 



But now, 

n 

E(|7(A) -7(^)1) <^E 
1=1 



1 d2 



1 p2 



1 + iij2trace i^M,') 1 + iiJfE (trace {^M,')) 



i=l 

n 



1 r2 



1 p2 



1 + ii?2E (trace (SM^"i)) 1 + ii^^E (trace (SMri)) 



where the second step fohows from similar arguments as in the proof of Lemma 3.3 (c) (using Lemma 3.3 
(b) and (a)). Thus, (30) follows from Assumption (29). 

We now proceed similarly as in Silverstein (1995) and El Karoui (2009a). Using (24), it is easy to check 
that Mg^Xj = (1 + ^R] qj)-^M~^Xj. Thus, setting T := ^{A)T,, so that Mt = A + 7(^)S, we get 



i=l 



n I 



qi 



Mg^ - My ^ = -Mg\S - T)M^^ = - ^ ^R^ Mg^XiX'iMj^^ + M^^TM^^ 

i=l 

n 

= -E 

and therefore 

n 

E {v'M-\) - E {v'M-^v) = - ^ E 



" ^ ^Rf Mr^XiX'M;^^ ^R^M^^T^M-^ 

nil I- I 1 nib 1 



i=l 



i + hRh 



l + ^R]E{d,) 



' \Rl Mr^XiXlM^' M-iSM^ 1 ' 



^ + 7iRhi 



i=l 



l + ii?2E(d,) 



(31) 



Now, using Lemma 3.3 (d), the independence of Xj and Xi, . . . , Xj^i, Xj+i, . . . , Xn, Lemma 3.3 (e), and 
Assumption (29), it follows that 



' l:R}v'Mr^XiX[M^^v' 



1=1 \ n 



EE 

i=l 
n 

EE 

1=1 

n 

EE 



'^Rlv'M-^XiX'filf^v 



i=l 



l + ^Ri^di) 

l + ii?2E(d,) 

' ^Rfv'Mg^mij^^v'' 

" 1 + ^Rf m) . 



+ o(l) 



+ o(l) 
+ o(l) 



This completes the proof of Proposition 3.1. 



□ 
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To verify Heuristic 2.2, we will prove the following result: 



Proposition 3.2. Suppose that the assumptions from the beginning of this subsection hold, the ratio p/n 
stays bounded, \\v\\ = 1 and 



^ n 



0(1) 



and 



y ( l^Ai|||s| 



\B\ 



o(l) 



(32) 



as n —)• oo. Then we have 

E {v\S + A)-^B{S + A)-^v) - E {v'{A + -f{A)E)-\B + ^{A, B)J:){A + 7(A)S)-^w) ^ , 
where ^{A) is defined in Proposition 3.1 and 



i{A,B) :-- 



1 " 
n ^ 



l^^2 



l + l.R^trace{^M~')) 



1 



-trace (S(5 + A)-^B(S + A)-^) 
n ^ ' 



Proof. Similarly as in the proof of Proposition 3.1, we first show that we may replace 7(A) and £,{A,B) 
with the deterministic quantities j{A) and (,{A,B), where ^{A) is defined in the proof of Proposition 3.1 
and 

~7 -ft • 



aA,B) := 



■E (trace (SM-^SM"^)) 



+ (trace (SM"!)))' 

To begin with, similarly as in (26), CiAB) and (,iA,B) are bounded by Yl]=i lll^llbi"^- It therefore 
follows from Assumption (32) that ^(^, -B)E and ^(A,B)Y, are bounded (in operator norm) by i(r|||i?|||2, 
where K is a constant. Thus, using the decomposition 

(A1A2A3 - ^152^3) = (^1 - ^1)52^3 + Ai{A2 - B2)B3 + ^1^2(^3 " ^3) , 

we see that it suffices to check that 

E (17(A)- 7(^)1) |||i?|||2|||S|||2 = 0(1) and B {{^A, B) -^A, B)\) m\\2 = o{l) . (33) 

The former bound is clear from (30), since we are assuming (32) instead of (29) now. For the latter bound, 
let us note that 

B{\aA,B)-aA,B)\) 

i^Rf trace {T^Mg^BM^^) ;^i?fE (trace (SM^^SM^^)) 



n 



i=l 



(1 + trace (SM^"^))' (l + ^iJ^E (trace {^M^'))f 



j=i 



Rf E (trace {T^Mg^BM^^)) -^Rf E (trace {llMr^BMr^)) 



1 d4 



:i + ii??E (trace {^Mf))f (l + ^i??E (trace (SM"!)))' 



where the second step follows from similar arguments as in the proof of Lemma 3.3 (c) (using Lemma 3.3 
(b) and (a)). In view of Assumption (32), this establishes (33). 
Put T := 7(A)S, T{u) := j{A + uB)T. (n > 0) and observe that 



d 

du 



(S + A + uB)'^ 



u=0 



and 



du 



(r(n) +A + uBY 



u=0 



-{S + A)-'^B{S + A)-^ 
(T + A)-^ {B + 1{A, S)S) (T + A)-^ . 
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Thus, replacing A with A + uB in (31) and calculating the derivative with respect to n at u = 0, we get 
-B{v'M-^BM-^v) + E{v'M-^{B + '^{A,B)^)M-^v) = L»i + ^2 + ^3 , (34) 

where 

= + Ve " \ v'Mr^BMr^XiX'M~^v v'M~'^BM^^^M-^v , 



* . ' ^ . l + ii2i^E(d.) 
= + E ( t;'M-iX,X;M-i (i? + e(A, S)S)M-i7; 

" ^ 7;'M5^SMy^(S + ^(A,S)E)My^r; 



1=1 \ ' n I 
i=l \ - ' n"i 



l + ii?2E(d,) 

Similarly as in the proof of Proposition 3.1, the idea is to show that for i = 1, 2, 3, — t- as n — t- 00. 
Making appropriate use of Lemma 3.3, this follows by essentially the same calculation as in the proof of 
Proposition 3.1, so that we go fast over the details. 

For the first difference, we use Lemma 3.3 (d), the independence of Xj and Xi, . . . , Xj+i, . . . , X„, 

and Lemma 3.3 (e) to obtain 

/ 1 734 



1=1 



(In the final step, we have also used (25) and (26) to see that the fraction is bounded by iii?|||i?|||2t 
For the second and third difference, it follows by similar arguments that 



v'M7'BMr^X,X[M-\ 

i=l n^ili 
1 p2 

l + lRf^di) 

1 p2 

l + ii?fE(d,) 



" / 1^2 \ 

1 ^ iVp-M.^ vM-^BM-^X,X[M^\ +o(1) 



i=l 

: Ve v'Mr^BMr^T.M-^'v \ + oil) 

1=1 



51 



and 



" v'M7^X,X[M;,\B + i[A,B)T.)M~\ 

i=l V+n^i'ii J 

= I ^ ^ 1 VE(d,) ^'^"'^^-^^'^T + B)J:)M^'v\ +o(1) 

" / 1^2 _ \ 

= 5^^ 1 1 + 4^ E(ci.) '''^^''^^tHB + g)S)M-^^ j +o(l) 

" / 1^2 _ \ 

= 5Z ^ l l + lVE(cj.) '''^s'^^T'iB + C{A, B)^)M-'vj + o(l) . 
This concludes the proof. □ 



3.6 On the Rate of Convergence 

Suppose that the constants 6^,(4; Xj) and hQ^{2;Xi)/n from (3) and (4) are uniformly bounded. Then 
our results show that if the Ri are also uniformly bounded, we have, for instance, 

E(|<7(a;X)-E(5(a;X))|2) =0(n-i) 

and 

\B{g{a;X)-g{a-Y))\=0{n-^l^). 

More generally, this still holds if the Ri are given by i.i.d. random variables with finite 4th moments. 
For some applications (e.g. to the field of finance), it may be helpful to have the same results under the 
weaker assumption that the Ri are given by i.i.d. random variables with finite second moments only. Recall 
that this is the minimal reasonable assumption, for if the Ri do not have second moments, the covariance 
matrix of the vectors RiXi is not defined. In this section we sketch how to derive results under this minimal 
assumption. 

However, to derive our results, we need somewhat stronger conditions on the covariance matrices Sj, 
the regularizing matrix A and the distributions of the random variables Xj. More precisely, we will work 
under the following additional assumptions: 

• We have p/n > Cg for some Cg > 0. 

• We have ^ XlILi — fo'^ some Cr < oo. 

• We have ^trace {A) < Ca for some Ca < oo. 

• We have ^trace (Il)j < Ce for some Ce < oo. 

• There exist cs > and e G (0, 1) such that the number of eigenvalues of Sj which are less than < cs 
is less than pe. 

• We have |||Sj|||2 < Ce, and the constants 6i(8;Xj) and bg^^A; Xi)/n'^ are uniformly bounded. 

Let us mention that the very last assumption could be weakened in exchange for a worse rate of convergence 
in the following results. For instance, we could easily allow for a bound of the order O(logn) or O(n^) 
(with (5 > sufficiently small). 

As it is our main intention here to give an idea of what is possible, we concentrate on one particular case 
and present results for the quadratic form g{a) := n^^a'DX(X'D'^X/n + A)~^X'Da (from Section 3.3.3) 
only. 

Our results rely on the observation that (i) normalized traces of regularized inverses of random matrices 
are typically strongly concentrated and (ii) under the assumptions stated above, E (trace (SjM~^)) is of 
the order n. Let us provide precise formulations: 
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The first observation has already been used several times in this paper (see also El Karoui (2009a)), 
for instance in the proof of Lemma 3.3 (b), where it is stated that 

P itrace (EM"^) - E (^^trace (SM"^)) > n) < 2exp{-u^ph'^ /8n\\\T.\\\l) . 

for any n > 0. For the second observation, we show the following lemma. 
Lemma 3.4. Under the afore-mentioned assumptions, we have 

B{^trace{^iM-^)) >c, 

where c = c{cg, Cr, Ca, Cs, ce, e). 

In the following proof, if A, B are any matrices and x G we call B (by slight abuse of terminology) 
a rank x modification of A if rank(A — i?) < x, and if M is any symmetric matrix, we let Ai(-/Vf ), . . . , Ap(M) 
be the eigenvalues M. 

Proof. We have 

E (trace {Mi)) = E (trace [a + Tl^^^J^'j)) = t^ace (A) + ^-Rj trace (S^) < p{Ca + CrCt.) ■ 
Thus, by Markov's inequality, it follows that with C := 4(Ca + CrCy)I{X — e), we have 

p (i e;., i,>,,M,ic, > ¥) < p (^t--- (M.) > 2(c. + c„c.)) < < h . 

Consider the set G where | X]j=i 1{Aj{a/i)>c} ^ so P (G) > \. Then, by spectral calculus, there exists 
a positive-definite rank p{}^) modification Mi of Mi such that Xj{Mi) < C for all j = 1, . . . ,p. Similarly, 
there exists a positive-definite rank pe modification Sj of such that Aj(5]j) > ce for all j = 1, . . . ,p. 

~ 1 /2 ~ T ~ 1 /2 

It follows that Aj(S^ M-~ S- ) > cs/C for all j = 1, . . . ,p. Indeed, for any vector x of norm 1, 

x'ty^Mr^ty^x > x%x/C > ce/C, 

~ 1 /2 ~ 1 ~ 1 /2 

and thus M~ ;^ cy,/C. By Theorem A. 43 in Bai and Silverstein (2010), it further follows that 
at least p{^-^) eigenvalues of T^iM^ are > c^/C. Thus, as T^J^Mr^Y.]'^ )^ 0, we have shown that on 
the set G, 

itrace (SiMri) > i^cs/C, 

1 1/2 1 1/2 

Ibmce P(G)>iand St'^Mr^St'^ ^ 0, we may conclude that 

iE (trace (S^Mri)) > if^c,cs/C =: c. 

□ 

Corollary 3.1. We have P (i^race (SiM^^) < Ic) < Coexp(-con). 

Let us now investigate the implications of these observations for our results concerning g{a, X): Recall 
from the proof of Theorem 3.4 that (with the notation there) 

n 

E (|5(a; X) - E {g{a; X)) P) < E (|r - T^j^) , 



where 



qi{Xi)-di ^ 1 

" {l + ^q,){l + ^d,) l + ^d. 



+ ^S^' "'^^'^ {I^/nC! - 2a.R./V^Ci 



(l + fg,(X,))(l + fd. ' 



53 



Now, on the set Gi := {^di > ^c} (which has probabihty 1 — o(l) by the preceding corollary), 

\T-Ti\ < —a^\qi{X,)-di\+(- + l] {l/nC! + 2a,/V^\Ci\) + —\d, - qi{Xi)\ {R^ /nCf + 2a^Ri/V^\Ci\) . 
cn \^ J cn 

Using (3) and (4) as well as Cauchy-Schwarz inequality, it follows that 
Bi{\T-Ti\ IgJ <K{c)lai-^bQ^{2;Xi)-^ + -^ + 



^ 1 Rf y/bQ,{4;Xi)y/bL{8;X,) ^ 1 ^^Rj ^hQ,{A-X,) V6l(4;X,) 
Ti? t"^ ^ n t"^ t 

where K{c) denotes a numerical constant which depends on c. Since the right-hand side is deterministic, 
the same bound holds for the unconditional expectation. 

On the complementary set Gp, we can use the fact that \T — Ti\ < 1 + af to obtain 

E (^|T - Tip l^c) < KP (Gf ) < A'Coexp(-con) . 

Summing over i = 1, . . . , n and recalling our assumptions, we conclude that 

n 

J2 E {9{a; X) - E {g{a; X))f = O(n^i) . 
1=1 

Similar considerations can be made for the Lindeberg approach, with the result that 

\-E{g{a-X)-g{a;Y))\=0{n-^/^). 

4 Relevance to statistical problems 

As discussed in the introduction, many quantities of statistical interest can be analyzed using our 
results. We will find deterministic equivalents for them. To keep the presentation readable for readers 
interested more in the applications than in the theory, we do not repeat the assumptions of our theorems. 
So all our statements should be understood as being prefaced: "assuming that the technical conditions led 
our earlier in the paper are satisfied, we have...". 

What the reader should essentially know is that shrinking the sample covariance matrix to a deter- 
ministic matrix A has the effect of essentially shrinking a scaled version of the population covariance to 
the same matrix A. The damping factor depends on A and S and is estimable. When the mean is also 
estimated, the results of Subsection 3.3.2 need to be applied. 

Our results show the remarkable robustness of random matrix results - we need very little control over 
the particulars of the data distributions - though they highlight their sensitivity to geometric assumptions. 
We now give a few examples where these computations are relevant and shed light on statistical matters. 



4.1 Estimation issues 

4.1.1 Estimation of v' {T, + A)~^v when E is not observed directly 

The motivation for this kind of question comes from understanding the population behavior of certain 
statistical procedures from observed data and hence deriving benchmarks as to how well a procedure could 
do. This could be used in evaluating a kind of regret, directly from the data. 

Recall our general setting, namely we observe 

^2 — ~t~ Ri-X-i , 

where Ri are possibly random and Xi are random with distributions satisfying "our usual assumptions" 
(see Subsection 3.2). In particular, X/'s have mean 0. Recall also the notation 5 = ^ SILi ^i-^i-^i 
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We have shown that we can find a deterministic equivalent to v'{S + A) ^v, namely, 

v'{S + Ay\ ~ v'{j{A)^ + Ay\ . 
We first note that since S = 5 — Jljl', 

v'{^ + Ar\ = v'{S + A)-^v + /""'^^^^.^^ '^f^ ~ v'{S + A)-\ ~ v'{-f{A)i: + A)-\ , 

l-fi'{S + A) V 

as we have seen that v'{S + — 0. 

Now in certain situation, for instance to when we want to estimate the optimal risk of certain statistical 
procedures, we will need to estimate v'{T, + A)~^v. We now sketch how to come up with an estimator of 
this quantity. 

Let t be a real in ]R_|_. We clearly have 

//^^ A^ 1 // / 1 v'(T. + tA/'y(tA))~^v 

v'{T. + tAy^v ~ v'{-f{tA)j: + tAy^v = — ^ ^^^^-^^ — — . 

Now recall that 



7{A) = -y — , — ^ = -y- 



t 1 + ^trace + ^ ^ ^ + ^?"(^) ' 



and under our assumptions, 7(A) has an asymptotically deterministic equivalent. Note that under concen- 
tration assumptions on Xj's, 

X'i{Si + Ay^X^ ^ trace + Ay^) 
n n ' 

and using rank-1 update. 



SO we need only invert {S + Ay^ once to compute efficiently all the terms we are interested in. (Of course 
in practice, we do not have access to RiXi, so we will use Yi — fx = RiXi — Ji. Because Ji'{S + Ay^Jl is 
of order 1 and we will be dividing everything by n, we can neglect this term in this discussion. The same 
applies to terms of the form Ji^S + A — JiJl'y^Xi). 

So we can now estimate Rfa{A), and using the fact that 



<^y n^^l + R^aiA)" 

we can also estimate ^{A). 

So to estimate v'iTi + Ay^v, all we need to do is find t such that 

to 

We will now show that j{tA)/t is decreasing; hence a simple dichotomous search will yield a fast algorithm 
for finding this io- 
We note that 



^{tA) _ 1 ^ R 



t 



1=1 



2 



t + :^trace(S(cS/t + A)-i) 



Now {S/t + A) ^ is clearly increasing in the Loewner order, and hence so is trace + A) since we 

are dealing with positive semi-definite matrices. Therefore, 

decreasing. 
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We note that its limit is at infinity and infinity at 0. Hence the equation 

— - — = 1 has a unique solution, to . 

We now have found an estimator oi v' {Ti + A)~^ v , since 

+ Ay^v ~ tov'(J: + tQA)-^v . 

4.2 Classification 

Random matrix techniques offer us insights into the behavior of standard methods in high-dimension. 
Our work could be helpful in tuning regularization parameters, picking penalties etc... because we are 
able to predict performance of the methods, provided our assumptions are met. It is nonetheless clear 
that sometimes (actually many times), some of the quantities we are considering could be evaluated by 
leave-one out methods, which can be implemented efficiently because of rank-1 updates. In that case, our 
analysis has the merit of explaining the behavior of the techniques considered, something that alternative 
estimators (such as cross-validation) do not do. 

A standard technique in classification is linear discriminant analysis. Some analysis in the high- 
dimensional context has already been done (Bickel and Levina (2003)), in a somewhat different direction. 
Here our aim is to explain what creates problems with LDA in high-dimension, even in the Gaussian case, 
and discuss briefly the behavior of Regularized discriminant analysis (RDA) proposed in Friedman (1989). 



4.2.1 A preliminary remark 

In the classification context (see details below), we will often be faced with a situation where a (regu- 
larized) covariance matrix is a pooled estimator of covariance computed from two groups, i.e 

In our context we will assume that the observations in each group have the same mean fn, where /Xj 
may depend on i = 1,2. Assuming that the data is of the form 

= + Rk^k ) 

where Xj, has mean 0, we have for instance 

k=i 

where Xj. have mean 0, and 

1 

^ i=l 

We will naturally encounter forms of the type 

(/I2-W)'(S + A)-1(/I2-W) 

and we now explain how to find deterministic equivalents for the limiting behavior of these forms. We note 
that juj = /Uj + ptj, so we will have to work out three quantities: 

(/Z2 - m)'(S + A)~^{pi2 - fii) , (Ai2 - m)'(S + A)~\jl2 - Jli) and (p^2 - w)'(S + ^)"Hm2 - w) • 

The first one is simple as it involves a shrunken matrix and deterministic vectors. The other two are a bit 
more subtle, since S and /Ij's interact (the Gaussian case being an exception for obvious reasons). 
We call 

'5i = _ Rk^kX'k and 52 = ^ _^ Rk^kX'k , 

^ k=l ^ k=Ni+l 
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and, if Pi = {Ni - l)/{Ni + N2 - 2), 

S = piSi + P2S2 . 

We note that, more generally we will have, for some pi (for instance pi = piNi/{Ni — 1) if we wish to 
preserve unbiasedness) , 

+ A = S + A- pljllp'i - P2/J2^t2 ) 

where 



N1+N2-2 



1=1 



• On (/I2 - + A)-i(/l2 - /ii) 

Using a rank-1 update formula (we could also use a more general version of the Sherman- Woodbury- 
Morrison formula), we have 



77' (y ^ A)-^ = f^2iS + ^ - Pl^^l^J■l) ^ 

^ l-p2/J'2(5 + ^-pi/Ii/i;)-i/l2 



Therefore, we have in particular, 

Jl'2it + A)-^Jl2 = ^ (- - 1 

P2 \1 - P2lJ'2{'S + A - pil^ll^\) 

We also see by the same token that 

]i'2{s + A - p,ji^ji[)-'ji2 = + Ar'j^2 

1 -Pl^J'l{'S + A) 

Now recall that (see Subsubsection 3.3.2) 

ji'2{i: + A)-% ~0. 

So we conclude that 



1 -P2f^2\'S + A) 



Naturally, our work in Subsubsection 3.3.2 allows us to find a deterministic equivalent to 

ji'2{s + Ay^ji2 

and so from then we get a deterministic equivalent to /i2(^ + ^)~^A*2- Of course, a similar analysis carries 
through for Jl'iiS + A)~^/Ii. To be more precise, if we call Ifj- the vector that has 1 if Xk is in group i and 
otherwise, we see that the a that corresponds to /I2 is 



and we can apply our formulas. 
We also need to consider 

J1'2{S + A)-^J1, . 

Using the rank-1 update formula, we have 

Jl2{S + A- pipip'^y^Jii 



'P'2{S + A)-^Jxi 



- P2fJ'2i'S + A - pifli^[) V2 



We have already worked out an approximation to the denominator. Now for the numerator, we have 
obviously 



fj,2{S + A - pifiifL^) m 



I - PiJi[{S + A)-^Jli 
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Hence, again, we see that in the asymptotic hmit we consider 



/i2(5 + A- pinii-i'i) V^i - . 



So we conclude that 



• On (/I2 - + A)-V 



The idea is here again to use our rank-1 update formulas. We have 




^2(5 + A - pifiifj.[) ^fj. 



- P2fJ-2{<S + A - piij.ifi[) V2 ' 



We also have 



/i2(5 + A - piflifl[) V 



IJ,2iS + A) V + Pi 



1 - Pill[{S + A)-^fli 



So we conclude that if, for instance 



stays bounded 



We now have all the elements needed to get an asymptotically deterministic approximation to 



(/22-w)'(S + A)-i(/l2-/Ii) • 



4.2.2 LDA: Gaussian case 

We recall the (optimal) setup. Suppose we have two groups (or classes). The observations can come 
from group 1 or group 2. In both groups they are A/'(^i,2; 5]). The probability of belonging to group 1 is 
TTi. The question is now given an observation, how should it be classified? 

It is easy and standard to find the optimal rule in the population. Namely, by doing likelihood com- 
putations, one quickly realizes that the optimal classification rule is (Hastie et al. (2009)): classify an 
observation as belonging to Group 2 if 



Naturally, in practice, S and /ii and ^2 need to be estimated. A natural solution is to use the training 
data (which is labeled, i.e we know to which class each observation belongs) to estimate fii and fi2 and 
then use a pooled estimate of covariance for S. 

In somewhat more details, if we have A'^i observations that belong to class 1 in our training set, and 
N2 that belong to class 2, let us denote by /ii and /i2 the sample mean of the observations in group 1 and 
group 2. If Si and S2 are the sample covariance in each of these groups, then our estimate of S is 



(We will assume in the following discussion that p < Ai + ^2 — 2 so S is invertible.) It is now natural to 
ask the following questions: 

1. how does naive LDA perform? 

2. how suboptimal is the naive threshold? 

3. is it possible to estimate the minimal misclassification rate, even if we cannot find the optimal 
direction on which to project a new observation? 
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Naturally, when a Gaussian vector is projected on a direction d, its distribution becomes N'{fi'd,d''Sd). 
If our decision rule is to classify x to Group 2 if x'd > t, it is clear that the misclassification rate is, if 
Hi{d) = fi'id, fJ-2{d) = iJL^d and o"^((i) = d'Sd, 

vr,(l-cI>(i^M^))+vr,$(i^). 

a a[d) 

A simple computation therefore shows that the optimal threshold is 



log(7ri/7r2) H 



fi2{d) - fii{d) 



Hence we have 



a 



a 2a 

We can therefore compute the optimal misclassification rate as 



^2 - /Ui -n-2 



G 



7ri(l - ^( ^ ^ + log — + ^2^ - ^ + 



log(^)) 
vr2 



Note that in LDA in the population, we have //2(c^) — Mil^^) = (M2 — A*i)'5^~"'^(a*2 — A*i) = o"^- Hence some 
simplifications ensue; in particular, the optimal misclassification rate is, if a is the Mahalanobis distance 
between [12 and /xi, 

TTi - vri$(^ H log(7ri/7r2)) + 7r2$(-^ H log(7ri/7r2)) . 



Hence, our problems reduce to: 



1. Estimate the Mahalanobis distance between and \i2 so we can compute the optimal misclassifica- 
tion rate for the problem 

2. Estimate t* from the data to obtain a procedure that outperforms the naive procedure. 

We note that it is good practice to do cross-validation to estimate t* - and this has been recognized by 
practitioners, see Hastie et al. (2009). However, even when the data is Gaussian, as we show below, a 
correction to the naive empirical threshold is needed in high-dimension. 
• Estimation of t*. When d = T,~^{jl2 ~ P-i)^ we have 

a^{d) = (/22 - /Ii)'S"^I;S~Hm2 - w) , 
Hi{d) = /i.S-^(/l2 - w) ■ 

In the Gaussian case, using properties of Wishart matrices (the interested reader is also refered to El 
Karoui (2009c) for similar computations, but going beyond the Wishart case), we see that, if /? = p/N, 

1 1 



a (d) ~ (/i2 - /Ui)'S {fi2 - ^1) 



+ 



On the other hand, 



1- p 



Now from the data we can get an estimate of {^2 — /fi)'S ^{Jj-2 — Jj-i)- A simple computation, based on 
properties of Wishart matrices (see e.g El Karoui (2009b) for full details) gives: 



{fl2-pi)'^ ^{^2-^1) 



1 



(^2-W)'S-n/^2-/il) + |^ + ^ 



il-p?a\d). 



On the other hand, 



P2T. ^(/22 - /ii) - 
p[T.~^{p2 - pi) ^ 



1 



l-p 
1 



1 



p'^T: ^{P2-Pl) + ^ 

PiT.~^{p2 - /^i) - ]^ 
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So we can estimate pi{d) by 



P) 



where it is 1 for i = 1 and it = — 1 for i = 2. 

We can now estimate t* by putting together ah these estimators. (We note that we could also do this 
by using estimate of cr'^{d) and fii{d) based on leave-one out procedures. However, the advantage of the 
procedure proposed here is that the amount of extra computations is extremely small, since the corrections 
are known in closed form.) 

On the other hand, it is clear that the naive threshold value is (in general) suboptimal. As a matter of 
fact, it is approximately 



1 



1 



On the other hand, if maha = (/i2 



- PiYT, ^(/i2 + /il) + 



1 


' p 


p ' 




1-p 


N2 







t* 



1 1 



2 1 



-(Ai2-Aii)'S ^(^2 + /ii) + log(vri/7r2) 



+ log(7ri/7r2) . 



Ni N2J maha 



Let us further remark that when A^i = because log(7ri/7r2) = 0, our correction returns exactly the naive 
threshold, and hence will not yield improvements. On the other hand, in this situation, the naive threshold 
is close to optimal and our analysis shows that further numerical investigation of a good threshold is not 
needed. 

In other respects, it is rather easy to estimate {fi2 — /ii)'S~^(/i2 — pi), and hence get the optimal 
misclassification rate for any classification procedure, in the case where the data is truly Gaussian. Note 
that this is not available by using cross-validation. 

Hence, beside shedding light on the potential (limited) problems of LDA in high-dimension, the com- 
putations we showed can be used to establish a benchmark for how well a classification procedure can 
perform and perhaps helps the user in choosing something better than LDA - or convincing her that LDA 
(perhaps corrected) in her context is performing quite well and close to the optimum. 



4.2.3 "LDA": elliptical case 

We are now interested in finding a reasonable classification procedure for elliptical data in high- 
dimension. We will see that the results obtained in this paper are relevant to shed light on their behavior. 
We consider the case here where iZj's have a smooth density. The data is modeled as 

= pl,2 + RiXi . 

We will focus on the case Xi ~ A^(0, S), though some of the computations could be carried in a more 
complex situation. Let us call / the density of R. The density of 

^ = Pl,2 + RiXi , 
is, since it is a continuous scale mixture of normal, 

exp 

Hence, it is difficult to get an exactly optimal classification rule by using a likelihood method. Nonetheless, 
we can apply Laplace's method to approximate this integral. 

We now recall the model from which X is generated and we see that ^) ^) jg concentrated 

around R^ if (/ii,2 — p)^~^{pi,2 — p) = 0(1). 

We are going to make the assumption that {p2 — /LIi)'S~^(^2 — pi) = 0(1). Calling, for y a dummy 
variable assumed to take values only where X concentrates, 

ap{i) = , 
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we see that ap{i) = 0(1) (indeed ap{i) ~ R^; see the remark on X above) and 

\ap{l) - ap{2)\ = 0{l/p) . 
Hence applying Laplace's method, we see that 

Hy^fJ-i) ~ f{\lap{i)) exp(-p/2(log(ap(i)) + l))^ TTap{i) /p 
Hence, under our assumptions, if A = p{ap{2) — ap(l)) (which is of order 1), 



exp(A/(2«p(l))) . 



Now —pA = 2(^2 — /t^i)'S + /^i^ ""^^1 ~ /t^2^ Hence, if the prior probabilities are vri and 7r2, a 

reasonable rule for classification appears to be: classify in group 2 if, for a new observation y, 

yS (;U2 -pti) > ap(l)log ( — 1 + ^ . (35) 

Here, in what is perhaps a surprise, we see that in high-dimension, in the class of elliptical distribution 
a procedure similar to LDA seems quite reasonable. 

Under our assumptions, it should be noted that in high-dimension, ap(l) ~ (for new data generated 
according to our model). Therefore this rule consistent with LDA, since for Gaussian data Rf = 1. Now, 
if \\n2 — /iilp <C trace (S), we see that 



trace (S) 

hence, the rule is approximately implementable - though situations where Ri has very heavy tails are likely 
to be very hard on these approximations. 

Now in the elliptical case, we know (see El Karoui (2009b)) that there exists s such that if 3£ is 
independent of S, /Ii and /U2, 

- W) ^ sX'S-nAi2 - m) 

and we can also find an approximation of 

(M2 - /il)'S"^(/i2 + W) 



through appropriate corrections, the key computations having been carried out in El Karoui (2009b). 
Hence, we can design a classification rule by using (nearly) unbiased estimators of the quantities on both 
sides of Equation (35). This could naturally also be done using leave-one out procedures, though these 
procedures would not explain what is happening. 

•On changing estimators of covariance 
One advantage of the analyses we have carried out is that they reveal (somewhat explicitly) the role played 
by the RiS. Since those are essentially estimable (for instance in the Gaussian, and in general as soon as we 
have measure concentration), we could also envision different weighting schemes, in particular putting all 
of them to 1 (which is extremely natural from a convexity standpoint), which amounts to using estimators 
which are similar in spirit to Tyler's estimator (see Tyler (1987) and El Karoui (2009b) for more details.) 
Because the paper is already quite long, we will not seek an optimal procedure here, but our various 
estimates (here and in El Karoui (2009b), El Karoui (2009c)) can in principle be used to assess difference 
in performance between these estimators of covariance for the statistical tasks at hand. 

• Computing the misclassification rate in the elliptical setting 
Suppose we now use a simple threshold rule, similar to LDA, to classify. Though this is suboptimal, 
understanding the behavior of this simple rule is interesting, and helps shed light on various procedures 
and their robustness. 
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So suppose we classify an observation x to Group 2 if x'v > t. Suppose that x is elliptical and call / 
the density of the R. A computation similar to the ones carried before shows that the misclassification 
rate is 

J \yv'i^vrj J \yv'i^vrj 

Since we are able to estimate /Xi's, and f'Su, as well as (at least coarsely) the density / - since we can 
estimate the -Rj's, we can find the optimal threshold t* . This gives a principled alternative to cross- 
validation in this case (though leave-one-out techniques could also be used). 

4.2.4 RDA 

In Friedman (1989), partly motivated by questions having to do with the variability of LDA procedures 
(in particular when S is ill-conditioned), it was proposed to replace S by 

S = (1 - w)t + ioA , 

where A is a matrix towards which S is shrunken. The computations done in the first part of the paper 
allow us to measure the performance of RDA in our asymptotic context. 

Our results show that when w varies from to 1, up to a computable scaling factor, forms of the type 
v'YiV cover the range of v'[{l — A)S -|- XA\~^v, for A varying from to 1, though of course A is very different 
from w (and A depends on the ellipticity of the data). This property is something that is not immediately 
obvious in high-dimension. This is valid much beyond the Gaussian design case, as we have shown. 

Let us now illustrate this in the Gaussian case. In this case, we know how to pick the optimal threshold 
at given w and can compute the misclassification rate of the corresponding procedure. Our results also 
show that the naive threshold is suboptimal, and suggests corrections, though those can also be found 
using leave-one out procedures that do not rely on our understanding of the phenomena. (This is fairly 
similar to our more detailed LDA discussion.) 

Our computations also show that one should probably not use 5 or 10 fold cross-validation methods 
in high-dimension, since it affects that the ratio p/n, which is key in determining and getting optimal 
performance. 

Here again, a rigorous study of the impact of Ri on the quality of classification and the potential benefits 
of using robust estimate of scatter is now feasible but we postpone it to other investigations because of the 
length of this paper. 

4.3 Optimization problems 

Suppose we consider the optimization problem 

J min^ w'YiW 

\ subject to V'w = U ' 

where V is a p x k matrix of constraints, and [/ is a x 1 vector of values for those constraints. This 
is a canonical problem in portfolio optimization (see Meucci (2005), Markowitz (1952)). Under minimal 
invertibility conditions, the solution is 

^t^optimal = T,^^VM^^U , 

where M = V'T,-^V. 

Suppose that we estimate S by S = AS -|- A and suppose that V contains a constraint involving 
which is not known and needs to be estimated. Call w the corresponding solution and M = V' f{T,)~^V. 
Then our estimates allow us to get a deterministic equivalent to the naive estimate of the risk, namely, 
w'TiW = U'MU as well as the true risk of our allocation, i.e w'Y^w, at least when the number of constraints 
is fixed. 

Let us now be a bit more specific. Suppose = T, + A (scalar constants can easily be dealt with), that 
the number of constraints is fixed and V contains only fixed constraints (i.e nothing needs to be estimated, 
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and in particular not the mean - this is for instance the case when users in Finance perform minimum 
variance optimization, without regards for expected returns). Then, 

M ~ y (7(A)S + A)-^V = Ma, 

so we get as deterministic equivalent of the naive risk 

U' [V {-f{A)T. + A)-^V)-^U . 

The interpretation of this result is that the shrinkage procedure essentially produces an estimator which 
is a dampen shrinkage estimator, the damping factor being 7(^4). 

To compute the realized risk, ah one needs to do is look at ?7'M~ V(S + A)~'^T.{T, + A)~^VM~'^U . 
To understand this, we can just rely on the results of Heuristic 2.2, with B = T,. It should be noted that 

+ + A)-'^v ~ [1 + c{A, = [i + c{a, s)]m . 

Hence, 

U'M-^V'iT, + + Ay'^VM-'^U ~ [1 + ^{A, E)]U' MX^MMJ^U . 

The situation where V involves fi and is replaced by Jl in V can be investigated using our results on 
quadratic forms in DX{X' D^X + A)~^X' D and the other results we developed in the paper specifically 
for this task. 

Finally, to the reader who might wonder why the study of M~^Ti^M~^ is potentially useful, even in 
the setting where j£j are i.i.d and hence have the same covariance S, let us give a "practical" example: it 
is sometimes the case that in the context of portfolio optimization, one uses log-returns instead of returns 
to find the portfolio weights. This is found to be natural when the stock prices follow geometric brownian 
motions, as in the Black-Scholes model. But clearly, in that setting of log-normal prices, the risk exposure 
should be computed using the covariance of the returns and not that of the log returns - two matrices that 
are in general different. (Note that our results (and our work on log-normal distributions) also give risk 
predictions when using returns instead of log returns when working with log-normal data.) 

4.4 Ridge regression 

Suppose we consider ridge regression with a general quadratic penalty (a.k.a Tikhonov regularization) . 
Then /3 is found by solving 

^Hdge = argmin^lly - -^Xp\\l + Xp'Tp 



In 

where Y is our response, X is the design matrix and F is a psd matrix. It is easy to verify that 

In n 



bridge = — (r^'^ + AF)-ix'y . 



Suppose that Y = ^[^/3o + e]. Then, 

^ ,x'x , 1 ,x'x ^ X' . 

/3ridge = ( + AF /3o + —e) ■ 

n n n 

Hence, 

bridge - /3o = -A(^ + AF)-iF/3o + (^ + AF)-i— e . 

n n n 

The situation where the design is random can now be studied with our tools, provided the assumptions of 

our theorems are satisfied. 

For instance if e has covariance and mean 0, we have 

E (ll^ndge - MI\X) = )?fi'oT'{^ + Xrr'm + -trace f (^ + AF)-i^^(^ + XVyA . 
V*^ / n n \ n n n J 
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The first quantity can be analyzed using our results in this paper. The second one is comparatively simpler 
and comes out of random matrix arguments. For instance, when = Id„, we see that we are left with 

f.X'X , , .X'XX'X , , A f,X'X , , A , / ,X'X 

trace ( + AF "^ ( + AFr^ = trace + AF - Atrace F + AFV 

\ n n n J \ n J \ n 

and these quantities can be analyzed using standard results on Stieltjes transforms (as well as the derivation 
trick we use repeatedly in this paper). 

We also note that if X has a symmetric distribution (we could relax of course this assumption with 
some work done along the lines of what is done in the paper), 

1^/ f.X'X .X'X,X'X A\ ^ / 1' X ,X'X ^ X ' 1 

-E trace ( + XF)^^ ( + XT)^^ = E ^^( + AF)-^ 

n \ \ n n n J J \ \/n Jn n Jn \ n 



and we can therefore use the work done in 3.4.2. A similar argument would hold if Se were diagonal, with 
1 replaced by with = Ee(i,i). 

The arguments presented in this paper can also be used to understand the quantities ||/3ridge — /Solli 
directly, before taking expectation, if for instance we have a bound (with high-probability) on ||e||. 

Our concentration arguments also allow us to show that 

-E(||^ridgc(x,y)-/3o||i 



has the same limit as the conditional version 



5 Conclusion 

Our study aimed at showing that the tools of random matrix theory could be used to further our 
understanding of various statistical procedures based on shrinkage estimators of covariance. Despite the 
great recent interest in ^i-type regularizations, these more classical methods are still very useful and 
very much in use, which is why we undertook the task of explaining what they actually did (at least 
asymptotically) in high-dimension. We also note that our study has moved us now quite far away from 
"linear" models for the data and we have obtained results for distributions with genuinely non-linear 
structures, something that is very much needed to understand various practical applications. 

We have both shown what we think is a great distributional robustness of random matrix based results 
in this context and a great geometric fragility of those models: distributional assumptions are largely 
irrelevant as long as they have the same geometric implications for the data; when two models yield a 
different geometry, the limiting approximations can change completely. Hence it seems to us that our 
study highlights a basic applied fact: namely users of random matrix results should run diagnostic tests 
before they apply (or rely on) results obtained in Gaussian or Gaussian like situations (which are the only 
ones covered by the "classical" random matrix models). For otherwise, if there is e.g correlation between 
our n observations, or if the geometry of the dataset does not conform to "i.i.d Gaussian" geometry, naive 
random matrix predictions will prove unhelpful and uninformative at best. 

On a technical note, our results are quite general, thanks in large part to the approach we used, which 
does not require us to compute the limit (or deterministic equivalent) of various quantities to show it is the 
same when our data come from a wide class of possible distributions. It should be noted that our results 
encompass many distributions for which natural questions in random matrix theory (such as behavior of 
largest and smallest eigenvalues) have not yet been settled or even investigated. In the future, it might also 
be of interest to look into more general estimates of covariance, namely matrix functions of the (shrunken) 
covariance matrix, i.e estimates that apply a certain fixed function to the eigenvalues of the shrunken 
matrix and leave the eigenvectors as is. This seems very approachable by our methods, using Cauchy's 
formula for instance, but because this might be considered a bit less central to multivariate statistics we 
postpone a rigorous study of these questions to a possible future paper. 

APPENDIX 
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A A remark on robustness of spectral distributions 



This technical appendix is not directly related to the rest of the paper but shows how the methods we 
used can be utilized to analyze the robustness of another quantity of interest in random matrix theory, 
namely the spectral distribution of the matrix. (We put the result here because it fits our theme of 
robustness and is interesting but of course does not warrant its own paper.) 

In El Karoui (2009a), we investigated the robustness properties of generalizations of the Marcenko- 
Pastur equation and showed that it held under mild concentration requirements on the data. 

As a first step we showed that we could use Azuma's inequality to control the fluctuations of the Stieltjes 
transform for a very broad class of distributions. Now to show robustness, all we have to do is show that 
the expectation of the Stieltjes transform is the same for all the models we consider. In El Karoui (2009a), 
we limited ourselves to models for which the data jtj had the same covariance for all i. We can now use 
similar ideas to the ones we have developed in this paper to do it in a more general case. We call 

1 " 
n ^-^ 

i=l 

and the corresponding Stieltjes transform (for Sx + ^) 

mp^xiz) = -trace {{Sx + A — zld)~^) , 

where A is a (deterministic) psd matrix, z £ C"*" and Im [z] = > 0. We call u = Re [z]. 
We have the following theorem. 

Theorem A.l. Under the usual assumptions of this paper (see Subsection 3.2), assuming that the Ri's 
are deterministic, and {Xi}^^^ and {Yi}f^-^ have mean and are such that cov {Xi) = cov {Yi), we have, 
for any fixed z S C"*" , 

|E (mp,x(z) - m,,Y{z)) I < - ^ ^^Q^ + ^^^(1; Yi)] A - . 

1=1 ^ ' 

This extends some of the results of El Karoui (2009a), since under various concentration assumptions 
we will be able to control hQ^{^\Yi) and hQ^(\\Xi) (recall that when Xi are Gaussian with covariance 
bounded in operator norm, hQ^(\\Xi) is of order yjp). Note once again that the models considered here 
are richer than the ones considered in El Karoui (2009a). The main difference with the results of El Karoui 
(2009a) is that this new theorem covers cases where we cannot describe the limit, whereas in El Karoui 
(2009a) we described the limit "explicitly" . 

We refer the reader to El Karoui (2009a) (or Bai and Silverstein (2010)) for details explaining why 
showing a.s convergence of the Stieltjes transform at each z (and a mass preservation condition) gives a.s 
weak convergence of the spectral distribution. Essentially our theorem says that the existence of a limit 
needs to be checked only in the Gaussian case and that such a result would transfer over to more general 
distributions for which we control hq^i^^Yi). 

Proof. We go quick on the details of the proof because we have done many similar ones in the paper. We 
take a Lindeberg approach, naturally. It is clear that if Bj = ^ X^^^i ^k-^k^'k + ^ Sfc=j+i ^kXkYl + A 
(with obvious adjustments mentioned in the paper for j = 1 and j = n), all we have to do is understand 




Let us call Bj{z) = Bj — zld. Note that Bj is psd. By standard rank-1 updates arguments, we have 
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Let us call dj{z) = trace (^Bj(z) ^Sj), where is the covariance of Xj and Yj. Clearly, since dj{z) is 
independent of Xj and Yj , 



E 



So to control |E (Aj 



E 



E 



trace ( B- 



all we have to do is control, if we call qj{z)=X'jBr\z)Xj, 
R] X'^Bj\z)X, R] X'.Bj\z)X, 



E 



The quantity inside the expectation can be rewritten 

0, 



^^(z) - dj{z) 



p2 r2 

Lemma 2.6 in Silverstein and Bai (1995) shows that 

R]X'jBj\z)Xj 



n 



R: 



< 



Hence, |Aj| < 2/v and 



|E {Qj 



I^It;. I \dj{z) - qj{z)\ 



< -^E 

V n 



> 



By writing Bj (z) in terms of its eigenvalues and eigenvectors, we note that Im ztrace yBj {z)T, 

because Bj and are psd and z € C"*" (alternatively, Im zBj'^{z) is psd). Therefore lm[zdj(z)] > 0. 
Hence, 

1 1 

< - 

V 



|z(l + §d,(z))| 



So finally, 



n 



\E{n,)\<^-LB{\d,{z)-q,{z)\) . 



We now have to analyze dj{z) — qj{z). We notice that 

qj{z) - dj{z) = X'jMiXj + iXjM2Xj - Ej {X-MiXj + iXjA^hXj) 
where if a^s are the eigenvectors of Bj and its eigenvalues, we have 

Xk-u 



and 



Re 



Im 



Ml 



p 



BjHz)] =M2 = Y^ 



k=l 



(Afc - + v- 



Ml can be written as Mi = Mi^^ — Mi^_, where Mi^_|_ is formed by keeping the non-negative eigenvalues of 
Ml and replacing the negative ones by 0. Of course, Mi^+ and Mi - are psd (technically we should index 
them by u, but we do not do it to alleviate the notation). We now remark that Mi^_|_, Mi._ and M2 are psd 
with ||Mi^-|-|| < 1/v and \\M2\\ < 1/v. We can therefore conclude, using the fact that \z\ < |Re[2;] | + |Im[2;] | 
as well as the fact that Mi and M2 are independent of Xj that 

E{\d,{z)-q,{z)\)<^bQ,il;X,). 
Putting everything together we obtain the result announced in the theorem. □ 
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